Subscribe to our Newsletter

All Posts (212)

Successful sales force management is dependent on up-to-date, accurate information. With appropriate, easy access to business intelligence, a Sales Director and Sales Managers can monitor goals and objectives. But, that’s not all a business intelligence tool can do for a sales team. In today’s competitive market, marketing, advertising and sales teams cannot afford to wait to be outstripped by the competition. They must begin to court and engage a customer before the customer has the need for an item. By building brand awareness and improving product and service visibility, the sales team can work seamlessly throughout the marketing and sales team channel to educate, and enlighten prospects and then carry them through the process to close the deal. To do that, the sales staff must have a comprehensive understanding of buying behaviors, current issues with existing products, pricing points and the impact of changing prices, products or distribution channels. With access to data integrated from CRM, ERP, warehousing, supply chain management, and other functions and data sources, a sales manager and sales team can create personalized business intelligence dashboards to guide them through the process and to help them analyze and understand trends and patterns before the competition strikes.

The enterprise must monitor sales results at the international, national, regional, local, team and individual sales professional. As a sales manager, you should be able to manage incentives and set targets with complete confidence, and provide accurate sales forecasts and predictions to ensure that the enterprise consistently meets its goals and can depend on the predicted revenue and profits for investment, new product development, market expansion and resource acquisition.

Business Intelligence for the sales function must include Key Performance Indicators (KPI) to help the team manage each role and be accountable for objectives and goals. If a sales region fails to meet the established plan, the business can quickly ascertain the root cause of the issue, whether it is product dissatisfaction, poor sales performance, or any one of a number of other sources.

Since the demand generated by the  sales force management directly affects the production cycle and plan, the sales team must monitor sales targets and objectives with product capacity and production to ensure that they can satisfy the customer without shortfalls or back orders. If some customers are behind on product payments, a business must be able to identify the source of the issue and address that issue before it results in decreased revenue and results.

The ten benefits listed below comprise a set of ‘must haves’ for every sales team considering a business intelligence solution:

  1. Set targets and allocate resources based on authentic data, rather than speculation
  2. Establish, monitor and adapt accurate forecasts and budgets based on up-to-date, verified data and objective KPIs
  3. Analyze current data, and possible cross-sell and up-sell revenue paths and the estimated lifetime value of a customer
  4. Analyze the elements of sales efforts (prospecting, up-selling, discounts, channel partners, sales collaterals, presentations) and adapt processes that do not provide a competitive edge and strong customer relationships and client loyalty
  5. Measure the factors affecting sales effectiveness to improve sales productivity and correct strategies that do not work
  6. Achieve a consistent view of sales force performance, with a clear picture of unexpected variations in sales and immediate corrective action and strategic adjustment based on trends and patterns
  7. Understand product profitability and customer behavior, by spotlighting customers and products with the highest contribution to the bottom line
  8. Revise expense and resource allocation using the net value of each customer segment or product group
  9. Identify the most effective sales tactics and mechanisms, and the best resources and tools, to meet organizational sales objectives
  10. Establish a personalized, automated alert system to identify and monitor upcoming opportunities and threats

When the enterprise provides a single source, integrated view of enterprise data from numerous sources and enables every user to build views, dashboards and KPIs, every member of the sales team is engaged in the pursuit of strategic, operational and tactical goals. In this way, the enterprise can acquire new clients, retain existing clients, and sell new products and services without a misstep.

Read more…

Taxonomy of 3D DataViz

Been trying to pull together a taxonomy of 3D data viz. Biggest difference is I think between allocentric (data moves) and egocentric (you move) viewpoints. The difference between whether you then view/explore the egocentric 3D visualisation on a 2D screen or in a 3D headset is I think a lesser distinction (and actually an HMD is possible less practical in most cases).

We have a related benefits escalator for 2D->3D Dataviz, but again I'm not convinced that "VR" should represent another level on this - its more of an orthogonal element - another way to view the upper tiers.

Care to discuss or extend/expand/improve?

Read more…

Guest blog post by SupStat

Contributed by Sharan Duggal.  You can find the original article here.

Introduction


We know that war and civil unrest account for a significant proportion of deaths every year, but how much can mortality rates be attributed to a simple lack of basic resources and amenities, and what relationship do mortality rates have with such factors? That’s what I set out to uncover using WorldBank data that covers the globe for up to the last 50 odd years, and I found a strong relationship with some of the available data.

If you were to look at overall mortality rates, the numbers would be muddied by several factors, including the aforementioned causes of death, so I decided to look at two related, but more specific outcome variables – infant mortality as well as risk of maternal death.

Infant mortality is defined as the number of infants dying before reaching one year of age, per 1,000 live births in a given year.

Lifetime risk of maternal death is the probability that a 15-year-old female will die eventually from a maternal cause assuming that current levels of fertility and mortality (including maternal mortality) do not change in the future, taking into account competing causes of death.

While I am sure these numbers can also be impacted by things like civil unrest, it does focus on individuals who are arguably more subject to be impacted by things like communicable diseases and lack of basic provisions like clean water, electricity or adequate medical resources, among others.

So, what do overall mortality rates even look like?

The density plot below includes the overall infant mortality distribution along with some metrics indicating the availability of key resources. Infant mortality rates peak at around 1% and the availability of resources peak closer to 100%. In both cases we see really long tails, indicating that there is a portion of the population experiencing less than ideal numbers.

So to drill down further, let’s have a closer look at the distribution of both outcome variables by year. The boxplots below suggest that both Infant mortality rates as well as risk of maternal death have shown not only steady overall improvements over the years but also a reduction in the disparity of cases across country-specific observations. But the upper end of these distributions still represent shocking numbers for some countries with: over 10% of infants dying every year (down from a high of 24% in 1961) and a 7.5% probability that a 15 year old girl living today will eventually die of a maternal cause (down from over 15% twenty-five years ago).

Please note: points have been marginally jittered above for clearer visual representation

Mortality Rates across the Globe


The below map plots the 2012 distribution of infant mortality rates by country. I chose 2012 because most of the covariates I would eventually like to use contain the best information from this year, with a couple of exceptions. It also presents a relatively recent picture of the variables of interest.

As can be seen, the world is distinctly divided, with many African, and some South Asian, countries bearing a bigger burden of infant mortality. And if it wasn’t noticeable on the previous boxplot, the range of values, as shown in the scale below is particularly telling of the overall disparity of mortality rates, pointing to a severe imbalance across the world.

The map representing the risk of maternal death is almost identical, and as such has been represented in a different color for differentiation. Here, the values range from close to 0% to over 7%.

Bottom Ranked Countries Over the Years


After factoring in all 50+ years of data for infant mortality and 26 years of data for risk of maternal death, and then ranking countries, the same set of countries feature at the bottom of the list.

The below chart looks at the number of times a country has had one of the worst three infant mortality rates in any given year since 1960.

The chart for maternal data goes from 1990 through to 2015. It’s important to note that Chad and Sierra Leone were ranked in the bottom 3 for maternal risk of death in every year since 1990.

Please note that numbers may be slightly impacted by missing data for some countries, especially for earlier years in the data set.

Relationship between Mortality & Resources


Getting back to the original question, are there any low hanging fruit and easy fixes for such a dichotomous situation? While my efforts during this analysis did not include any regressions, I did want to get an initial understanding of whether the availability of basic resources had a strong association with mortality rates, and if such a relationship existed, which provisions were more strongly linked with these outcomes? The findings could serve as a platform to do further research.

The below correlation analysis helped home in on some of the stronger linkages and helped weed out some of the weaker ones.

Note, the correlation analysis was run using 2012 data for all metrics, except for “Nurses and Midwives (per 1000 people)” and “Hospital beds (per 1000 people)” for which 2010 and 2009 data was used respectively, due to poorer availability of 2012 data for these measures.

 

Focusing on the first two columns of the above correlation plot, which represent risk of maternal death and infant mortality, we see a very similar pattern across the variables included in the analysis. Besides basic resources, I had also included items like availability of renewable freshwater resources and land area, to see if naturally available resources had any linkages to the outcomes in question. They didn’t and so they were removed from the analysis. In the plot above, it can also be seen that average rainfall and population density dont have much of a relationship with the mortality rates in question. What was also surprising was that access to anti-retroviral therapy too had a weak correlation with mortality rates in general.

The metrics that had the strongest relationship (in the 0.75 to 0.85 range) were:

  • Percent of population with electricity
  • Percent of population with access to non-solid fuel
  • Percent of population with access to improved sanitation facilities, and
  • Percent of population with access to improved water sources


The first two require no definitional explanation, but access to improved sanitation facilities ensure the hygienic separation of human excreta from human contact. Access to improved water sources refers to the percentage of the population using an improved drinking water source including piped water on premises and other improved drinking water sources (public taps or standpipes, tube wells or boreholes, protected dug wells, protected springs, and rainwater collection).

Analyzing the strongly correlating factors by Region


The following 4 charts look at regional performance of the key identified metrics. The pattern follows the same as that seen on the static world map from 2012, but this also gives us a view into how things have been trending on the resources that seem to be strongly linked with infant and maternal mortality over the past 25 years. We see a fairly shallow slope for Sub-saharan Africa on access to non-solid fuel as well as on improved sanitation facilities. Improvements in drinking water access have been much better.

South Asian countries ranked lowest on the provision of sanitation facilities in the early ’90s, but have made improvements since.

Conclusion


My analysis found a very strong relationship between mortality rates and basic provisions. It also weeded out some factors which were less important. As a next step, it may be helpful to do a deeper country-specific analysis for African and South Asian nations that suffer from a chronic lack of basic infrastructure, to see where investments would be most fruitful in bringing these countries to a closer state of parity with the developed world.

Read more…

Originally posted on Data Science Central

“I keep saying that the sexy job in the next 10 years will be statisticians, and I’m not kidding”

Hal Varian, Chief Economist at Google and emeritus professor at the University of California, better known as Berkeley, said on the 5th of August 2009.

Today, what Hal Varian said almost seven years ago has been confirmed, as is highlighted in the following graph taken from Google Trends, which gives a good idea of the current attention to figure of the Data Scientist.

The Observatory for Big Data Analytics & BI of Politecnico di Milano has been working on the theme of Data Scientists for a few years, and has now prepared a survey to be submitted to Data Scientists that will be used to create a picture of the Data Scientist, within their company and the context in which they operate.

If you work with data in your company, please support us in our research and take this totally anonymous survey here. Thank-you from the Observatory for Big Data Analytics & BI.

 

Graph 1: How many times the term "Data Scientist" has been searched on Google. The numbers in the graph represent the searched term in relation to the highest point in the graph. The value of 100 is given to the point with the maximum number of searches, the others values are proportional.

Mike Loukides, VP of O’Reilly Media, summarized the Data Scientist’s job description in these words:

"Data scientists are involved with gathering data, massaging it into a tractable form, making it tell its story, and presenting that story to others."

We are in the era of Big Data, in an era where 2,5 quintillion (10^18) of bytes are generated every day. Both the private and public sector everywhere are adapting so that they can exploit the potential of Big Data by introducing into their organizations people who are able to extract information from data.

Getting information out of data is of increasing importance because of huge amount of data available. As Daniel Keys Moran, programmer and science fiction writer, said:“You can have data without information, but you cannot have information without data”.

In companies today, we are seeing positions like the CDO (Chief Data Officer) andData Scientists more often than we were used to.

The CDO is a business leader, typically a member of the organization’s executive management team, who defines and executes analytics strategies. This is the person actually responsible for defining and developing the strategies that will direct the company’s processes of data acquisition, data management, data analysis and data governance. This means that new governance roles and new professional figures have been introduced in many organizations to exploit what Big Data offer them in terms of opportunities.

According to the report on “Big Success with Big Data” (Accenture, 2014), 89% of companies believe that, without a big data analytics strategy, in 2015 they risk losing market share and will no longer be competitive.

Collecting data is not simply retrieving information: the Data Scientists’ role is to translate data into information, and currently there is a dearth of people with this set of skills.

It may seem controversial, but both companies and Data Scientists know very little about what skills are needed. They are operating in a turbulent environment where frequent monitoring is needed to know who actually uses which tools, which tools are considered old and becoming obsolete, and which are those used by the highest and lowest earners. According to a study by RJMetrics (2015), the Top 20 Skills of a Data Scientist are those contained in the following graph. 

The graph clearly shows the importance of tools and programming languages such as Rand PythonMachine LearningData Mining and Statistics are also high up in the set of most requested skills. Those relating to Big Data are at about the 15th place.

The most recent research on Data Scientists showed that these professionals are more likely to be found in companies belonging to the ICT sectorinternet companies andsoftware vendors, such as Microsoft and IBM, rather than in social networks(Facebook, LinkedIn, Twitter) AirbnbNetflix etc. The following graph, provided – like the previous one - by RJMetrics, gives the proportion of Data Scientists by industry.

It is important to keep monitoring Data Scientists throughout industrial sectors, their diffusion and their main features, because, in the unsettled business world of today, we can certainly expect a great many changes to take place while companies become aware, at different times and in different ways, of the importance of Data Scientists

Read more…

Guest blog post by Kevin Smith

I teach AP Statistics in China at an International school and I believe it's important to not only show my students how to do plots and inferential statistics on their TI Nspire calculators, but also in R using ggplot, dplyr, and R Markdown.

We are starting the third unit in AP Statistics and we will be learning about scatter plots and regression. I will teach them how to do this in R and use R Markdown to export to Word.

I have already gone over some of the basics of opening RStudio and entering some data and saving to their home directory. We have R and RStudio on all forty of our school computers. They are also required to install R and RStudio on their home computer. I’ll keep the online Microsoft Data Scientist Workbench as a backup.

Here are some ggplot basics that I’ll start with.

I’ll use examples from our AP stats book and the IB book. We are using The Practice of Statistics 4th edition  by Starnes, Yates and Moore (TPS4e) for AP Statistics class. I want to recreate some of the plots in the textbook so I can teach my students how they can create these same plots. We can probably improve in some way on these plots and at the same time, teach them the basics of regression and R programming.

Here is my general plan:

  • Enter the data into the TI nspire cx.
  • Generate a scatter plot on the TI.
  • Use the Smartboard to show the code in R using RStudio.
  • On the first day use an R Script for the R code.
  • All following days, use R Markdown to create and annotate the scatter plots.
  • Publish to our Moodle page or maybe saturnscience website.

Making a scatter plot

Now let’s make a scatter plot with the example in the TPS4e book Chapter 3, page 145.

The general form of a ggplot command will look like this:

myGraph <- ggplot(myData, aes(variable for x axis, variable for y axis)) + geom()

Here is the data from page 145 in the TPS 4e textbook and how we enter it in. We use the “c” command to combine or concatenate into a vector. We then turn these two vectors into a data frame.

body.wt=c(120,187,109,103,131,165,158,116)  
backpack.wt=c(26,30,26,24,29,35,31,28)
TPS145= data.frame(body.wt,backpack.wt) TPS145

Now we put this data frame into the ggplot object and name it scatter145 and call the ggplot2 package.
library(ggplot2) scatter145=ggplot(data=TPS145, aes(body.wt,backpack.wt)) +      
geom_point()

Here is the scatter plot below produced from the above code:

This is a starting point and we can add to this plot to really spruce it up.

I added some blue color to the plot based on the body weight.

scatter145=ggplot(data=TPS145, aes(body.wt,backpack.wt,colour=body.wt)) + 
geom_point()

scatter145




Adding Labels And Adjusting The Size Of The Data Point

To add the x, y and main labels, I add on to my plot with the xlab, ylab, and main arguments inside ggplot’s scatter plot. I also increased the size of the plotted data to make it easier to see.

scatter145 = scatter145+ geom_point(size=2) +     
xlab("Body Weight (lb)") +
ylab("Pack weight (lb)") +
ggtitle("Backpack Weight")

scatter145

How To Add The Regression Line.

I will keep adding to the plot by plotting the regression line. The function for adding a liner model is “lm”. The gray shaded area is the 95% confidence level interval.

Here is the final code for creating the scatter plot with the regression line.

  scatter145=scatter145+ geom_point(size=3) +    
xlab("Body Weight (lb)") +
ylab("Pack weight (lb)")+
ggtitle("Backpack Weight")+
geom_smooth(method = "lm")

Here is the scatter plot with the regression line.



My motivation for working in R Markdown is that I want to teach my students that R Markdown is an excellent way to integrate their R code, writing, plots and output. This is the way of the near future in Introductory Statistics. I also want to model how reproducible research should be done.

Two research papers I read recently support this view.

Some Recent Research On Reproducible Research And Intro Statistics

The authors Deborah Nolan and Jamis Perrett in their paper Teaching and Learning Data Visualization: Ideas and Assignments paper here argue that statistical graphics should have a more prominent role in an introductory statistics course.

This article discusses how to make statistical graphics a more prominent element of the undergraduate statistics curricula. The focus is on several different types of assignments that exemplify how to incorporate graphics into a course in a pedagogically meaningful way. These assignments include having students deconstruct and reconstruct plots, copy masterful graphs, create one-minute visual revelations, convert tables into `pictures’, and develop interactive visualizations with, e.g., the virtual earth as a plotting canvas.

Another paper R Markdown: Integrating A Reproducible Analysis Tool into Introductory Statistics by Ben Baumer, Mine Cetinkaya-Rundel, Andrew Bray,Linda Loi and Nicholas J. Horton argue that teaching students R Markdown helps them to grasp the concept of reproducible research.

R Markdown is a new technology that makes creating fully-reproducible statistical analysis simple and painless. It provides a solution suitable not only for cutting edge research, but also for use in an introductory statistics course. We present evidence that R Markdown can be used effectively in introductory statistics courses, and discuss its role in the rapidly-changing world of statistical computation.

 
 
Read more…

R for SQListas (1): Welcome to the Tidyverse

Guest blog post by Sigrid Keydana

R for SQListas, what's that about?

This is the 2-part blog version of a talk I've given at DOAG Conference this week. I've also uploaded the slides (no ppt; just pretty R presentation ;-) ) to the articles section, but if you'd like a little text I'm encouraging you to read on. That is, if you're in the target group for this post/talk.
For this post, let me assume you're a SQL girl (or guy). With SQL you're comfortable (an expert, probably), you know how to get and manipulate your data, no nesting of subselects has you scared ;-). And now there's this R language people are talking about, and it can do so many things they say, so you'd like to make use of it too - so now does this mean you have to start from scratch and learn - not only a new language, but a whole new paradigm? Turns out ... ok. So that's the context for this post.

Let’s talk about the weather

So in this post, I'd like to show you how nice R is to use if you come from SQL. But this isn't going to be a syntax-only post. We'll be looking at real datasets and trying to answer a real question.
Personally I’m very interested in how the weather's going to develop in the future, especially in the nearer future, and especially regarding the area where I live (I know. It’s egocentric.). Specifically, what worries me are warm winters, and I'll be clutching to any straw that tells me it's not going to get warmer still ;-)
So I’ve downloaded / prepared two datasets, both climate / weather-related. The first is the average global temperatures dataset from the Berkeley Earth Surface Temperature Study, nicely packaged by Kaggle (a website for data science competitions; https://www.kaggle.com/berkeleyearth/climate-change-earth-surface-temperature-data). This contains measurements from 1743 on, up till 2013. The monthly averages have been obtained using sophisticated scientific procedures available on the Berkeley Earth website (http://berkeleyearth.org/).
The second is daily weather data for Munich, obtained from www.wunderground.com. This dataset was retrieved manually, and the period was chosen so as to not contain too many missing values. The measurements range from 1997 to 2015, and have been aggregated by taking a monthly average.
Let’s start our journey through R land, reading in and looking at the beginning of the first dataset:

library(tidyverse)
library(lubridate)
df <- read_csv('data/GlobalLandTemperaturesByCity.csv')
head(df)
df <- read_csv('data/GlobalLandTemperaturesByCity.csv')
head(df)

## 1 1743-11-01 6.068 1.737 Århus
## 2 1743-12-01 NA NA Århus
## 3 1744-01-01 NA NA Århus
## 4 1744-02-01 NA NA Århus
## 5 1744-03-01 NA NA Århus
## 6 1744-04-01 5.788 3.624 Århus
## # ... with 3 more variables: Country , Latitude ,
## # Longitude

Now we’d like to explore the dataset. With SQL, this is easy: We use WHERE to filter rows, SELECT to select columns, GROUP BY to aggregate by one or more variables...And of course, we often need to JOIN tables, and sometimes, perform set operations. Then there’s all kinds of analytic functions, such as LAG() and LEAD(). How do we do all this in R?

Entering the tidyverse

Luckily for the SQLista, writing elegant, functional, and often rather SQL-like code in R is easy. All we need to do is ... enter the tidyverse. Actually, we’ve already entered it – doing library(tidyverse) – and used it to read in our csv file (read_csv)!
The tidyverse is a set of packages, developed by Hadley Wickham, Chief Scientist at Rstudio, designed to make working with R easier and more consistent (and more fun). We load data from files using readr, clean up datasets that are not in third normal form using tidyr, manipulate data with dplyr, and plot them with ggplot2.
For our task of data exploration, it is dplyr we need. Before we even begin, let’s rename the columns so they have shorter names:

df <- rename(df, avg_temp = AverageTemperature, avg_temp_95p = AverageTemperatureUncertainty, city = City, country = Country, lat = Latitude, long = Longitude)
head(df)

## # A tibble: 6 × 7
## dt avg_temp avg_temp_95p city country lat long
##
## 1 1743-11-01 6.068 1.737 Århus Denmark 57.05N 10.33E
## 2 1743-12-01 NA NA Århus Denmark 57.05N 10.33E
## 3 1744-01-01 NA NA Århus Denmark 57.05N 10.33E
## 4 1744-02-01 NA NA Århus Denmark 57.05N 10.33E
## 5 1744-03-01 NA NA Århus Denmark 57.05N 10.33E
## 6 1744-04-01 5.788 3.624 Århus Denmark 57.05N 10.33E

distinct() (SELECT DISTINCT)

Good. Now that we have this new dataset containing temperature measurements, really the first thing we want to know is: What locations (countries, cities) do we have measurements for?
To find out, just do distinct():

distinct(df, country)

## # A tibble: 159 × 1
## country
##
## 1 Denmark
## 2 Turkey
## 3 Kazakhstan
## 4 China
## 5 Spain
## 6 Germany
## 7 Nigeria
## 8 Iran
## 9 Russia
## 10 Canada
## # ... with 149 more rows

distinct(df, city)

## # A tibble: 3,448 × 1
## city
##
## 1 Århus
## 2 Çorlu
## 3 Çorum
## 4 Öskemen
## 5 Ürümqi
## 6 A Coruña
## 7 Aachen
## 8 Aalborg
## 9 Aba
## 10 Abadan
## # ... with 3,438 more rows

filter() (WHERE)

OK. Now as I said I'm really first and foremost curious about measurements from Munich, so I'll have to restrict the rows. In SQL I'd need a WHERE clause, in R the equivalent is filter():

filter(df, city == 'Munich')
## # A tibble: 3,239 × 7
## dt avg_temp avg_temp_95p city country lat long
##
## 1 1743-11-01 1.323 1.783 Munich Germany 47.42N 10.66E
## 2 1743-12-01 NA NA Munich Germany 47.42N 10.66E
## 3 1744-01-01 NA NA Munich Germany 47.42N 10.66E
## 4 1744-02-01 NA NA Munich Germany 47.42N 10.66E
## 5 1744-03-01 NA NA Munich Germany 47.42N 10.66E
## 6 1744-04-01 5.498 2.267 Munich Germany 47.42N 10.66E
## 7 1744-05-01 7.918 1.603 Munich Germany 47.42N 10.66E

This is how we combine conditions if we have more than one of them in a where clause:

# AND
filter(df, city == 'Munich', year(dt) > 2000)
## # A tibble: 153 × 7
## dt avg_temp avg_temp_95p city country lat long
##
## 1 2001-01-01 -3.162 0.396 Munich Germany 47.42N 10.66E
## 2 2001-02-01 -1.221 0.755 Munich Germany 47.42N 10.66E
## 3 2001-03-01 3.165 0.512 Munich Germany 47.42N 10.66E
## 4 2001-04-01 3.132 0.329 Munich Germany 47.42N 10.66E
## 5 2001-05-01 11.961 0.150 Munich Germany 47.42N 10.66E
## 6 2001-06-01 11.468 0.377 Munich Germany 47.42N 10.66E
## 7 2001-07-01 15.037 0.316 Munich Germany 47.42N 10.66E
## 8 2001-08-01 15.761 0.325 Munich Germany 47.42N 10.66E
## 9 2001-09-01 7.897 0.420 Munich Germany 47.42N 10.66E
## 10 2001-10-01 9.361 0.252 Munich Germany 47.42N 10.66E
## # ... with 143 more rows

# OR
filter(df, city == 'Munich' | year(dt) > 2000)

## # A tibble: 540,116 × 7
## dt avg_temp avg_temp_95p city country lat long
##
## 1 2001-01-01 1.918 0.381 Århus Denmark 57.05N 10.33E
## 2 2001-02-01 0.241 0.328 Århus Denmark 57.05N 10.33E
## 3 2001-03-01 1.310 0.236 Århus Denmark 57.05N 10.33E
## 4 2001-04-01 5.890 0.158 Århus Denmark 57.05N 10.33E
## 5 2001-05-01 12.016 0.351 Århus Denmark 57.05N 10.33E
## 6 2001-06-01 13.944 0.352 Århus Denmark 57.05N 10.33E
## 7 2001-07-01 18.453 0.367 Århus Denmark 57.05N 10.33E
## 8 2001-08-01 17.396 0.287 Århus Denmark 57.05N 10.33E
## 9 2001-09-01 13.206 0.207 Århus Denmark 57.05N 10.33E
## 10 2001-10-01 11.732 0.200 Århus Denmark 57.05N 10.33E
## # ... with 540,106 more rows

select() (SELECT)

Now, often we don't want to see all the columns/variables. In SQL we SELECT what we're interested in, and it's select() in R, too:
select(filter(df, city == 'Munich'), avg_temp, avg_temp_95p)

## # A tibble: 3,239 × 2
## avg_temp avg_temp_95p
##
## 1 1.323 1.783
## 2 NA NA
## 3 NA NA
## 4 NA NA
## 5 NA NA
## 6 5.498 2.267
## 7 7.918 1.603
## 8 11.070 1.584
## 9 12.935 1.653
## 10 NA NA
## # ... with 3,229 more rows

arrange() (ORDER BY)

How about ordered output? This can be done using arrange():

arrange(select(filter(df, city == 'Munich'), dt, avg_temp), avg_temp)

## # A tibble: 3,239 × 2
## dt avg_temp
##
## 1 1956-02-01 -12.008
## 2 1830-01-01 -11.510
## 3 1767-01-01 -11.384
## 4 1929-02-01 -11.168
## 5 1795-01-01 -11.019
## 6 1942-01-01 -10.785
## 7 1940-01-01 -10.643
## 8 1895-02-01 -10.551
## 9 1755-01-01 -10.458
## 10 1893-01-01 -10.381
## # ... with 3,229 more rows

Do you think this is starting to get difficult to read? What if we add FILTER and GROUP BY operations to this query? Fortunately, with dplyr it is possible to avoid paren hell as well as stepwise assignment using the pipe operator, %>%.

Meet: %>% - the pipe

The pipe transforms an expression of form x %>% f(y) into f(x, y) and so, allows us write the above operation like this:

df %>% filter(city == 'Munich') %>% select(dt, avg_temp) %>% arrange(avg_temp)

This looks a lot like the fluent API design popular in some object oriented languages, or the bind operator, >>=, in Haskell.
It also looks a lot more like SQL. However, keep in mind that while SQL is declarative, the order of operations matters when you use the pipe (as the name says, the output of one operation is piped to another). You cannot, for example, write this (trying to emulate SQL‘s SELECT – WHERE – ORDER BY ): df %>% select(dt, avg_temp) %>% filter(city == 'Munich') %>% arrange(avg_temp). This can’t work because after a new dataframe has been returned from the select, the column city is not longer available.

arrange() (GROUP BY)

Now that we’ve introduced the pipe, on to group by. This is achieved in dplyr using group_by() (for grouping, obviously) and summarise() for aggregation.
Let’s find the countries we have most – and least, respectively – records for:

# most records
df %>% group_by(country) %>% summarise(count=n()) %>% arrange(count %>% desc())

## # A tibble: 159 × 2
## country count
##
## 1 India 1014906
## 2 China 827802
## 3 United States 687289
## 4 Brazil 475580
## 5 Russia 461234
## 6 Japan 358669
## 7 Indonesia 323255
## 8 Germany 262359
## 9 United Kingdom 220252
## 10 Mexico 209560
## # ... with 149 more rows

# least records
df %>% group_by(country) %>% summarise(count=n()) %>% arrange(count)

## # A tibble: 159 × 2
## country count
##
## 1 Papua New Guinea 1581
## 2 Oman 1653
## 3 Djibouti 1797
## 4 Eritrea 1797
## 5 Botswana 1881
## 6 Lesotho 1881
## 7 Namibia 1881
## 8 Swaziland 1881
## 9 Central African Republic 1893
## 10 Congo 1893

How about finding the average, minimum and maximum temperatures per month, looking at just records from Germany, and that originate after 1949?

df %>% filter(country == 'Germany', !is.na(avg_temp), year(dt) > 1949) %>% group_by(month(dt)) %>% summarise(count = n(), avg = mean(avg_temp), min = min(avg_temp), max = max(avg_temp))

## # A tibble: 12 × 5
## `month(dt)` count avg min max
##
## 1 1 5184 0.3329331 -10.256 6.070
## 2 2 5184 1.1155843 -12.008 7.233
## 3 3 5184 4.5513194 -3.846 8.718
## 4 4 5184 8.2728137 1.122 13.754
## 5 5 5184 12.9169965 5.601 16.602
## 6 6 5184 15.9862500 9.824 21.631
## 7 7 5184 17.8328285 11.697 23.795
## 8 8 5184 17.4978752 11.390 23.111
## 9 9 5103 14.0571383 7.233 18.444
## 10 10 5103 9.4110645 0.759 13.857
## 11 11 5103 4.6673114 -2.601 9.127
## 12 12 5103 1.3649677 -8.483 6.217

In this way, aggregation queries can be written that are powerful and very readable at the same time. So at this point, we know how to do basic selects with filtering and grouping. How about joins?

JOINs

Dplyr provides inner_join(), left_join(), right_join() and full_join() operations, as well as semi_join() and anti_join(). From the SQL viewpoint, these work exactly as expected.
To demonstrate a join, we’ll now load the second dataset, containing daily weather data for Munich, and aggregate it by month:

daily_1997_2015 % summarise(mean_temp = mean(mean_temp))
monthly_1997_2015

## # A tibble: 228 × 2
## month mean_temp
##
## 1 1997-01-01 -3.580645
## 2 1997-02-01 3.392857
## 3 1997-03-01 6.064516
## 4 1997-04-01 6.033333
## 5 1997-05-01 13.064516
## 6 1997-06-01 15.766667
## 7 1997-07-01 16.935484
## 8 1997-08-01 18.290323
## 9 1997-09-01 13.533333
## 10 1997-10-01 7.516129
## # ... with 218 more rows

Fine. Now let’s join the two datasets on the date column (their respective keys), telling R that this column is named dt in one dataframe, month in the other:

df % select(dt, avg_temp) %>% filter(year(dt) > 1949)
df %>% inner_join(monthly_1997_2015, by = c("dt" = "month"), suffix )

## # A tibble: 705,510 × 3
## dt avg_temp mean_temp
##
## 1 1997-01-01 -0.742 -3.580645
## 2 1997-02-01 2.771 3.392857
## 3 1997-03-01 4.089 6.064516
## 4 1997-04-01 5.984 6.033333
## 5 1997-05-01 10.408 13.064516
## 6 1997-06-01 16.208 15.766667
## 7 1997-07-01 18.919 16.935484
## 8 1997-08-01 20.883 18.290323 of perceptrons
## 9 1997-09-01 13.920 13.533333
## 10 1997-10-01 7.711 7.516129
## # ... with 705,500 more rows

As we see, average temperatures obtained for the same month differ a lot from each other. Evidently, the methods of averaging used (by us and by Berkeley Earth) were very different. We will have to use every dataset separately for exploration and inference.

Set operations

Having looked at joins, on to set operations. The set operations known from SQL can be performed using dplyr’s intersect(), union(), and setdiff() methods. For example, let’s combine the Munich weather data from before 2016 and from 2016 in one data frame:

daily_2016 % arrange(day)

## # A tibble: 7,195 × 23
## day max_temp mean_temp min_temp dew mean_dew min_dew max_hum
##
## 1 1997-01-01 -8 -12 -16 -13 -14 -17 92
## 2 1997-01-02 0 -8 -16 -9 -13 -18 92
## 3 1997-01-03 -4 -6 -7 -6 -8 -9 93
## 4 1997-01-04 -3 -4 -5 -5 -6 -6 93
## 5 1997-01-05 -1 -3 -6 -4 -5 -7 100
## 6 1997-01-06 -2 -3 -4 -4 -5 -6 93
## 7 1997-01-07 0 -4 -9 -6 -9 -10 93
## 8 1997-01-08 0 -3 -7 -7 -7 -8 100
## 9 1997-01-09 0 -3 -6 -5 -6 -7 100
## 10 1997-01-10 -3 -4 -5 -4 -5 -6 100
## # ... with 7,185 more rows, and 15 more variables: mean_hum ,
## # min_hum , max_hpa , mean_hpa , min_hpa ,
## # max_visib , mean_visib , min_visib , max_wind ,
## # mean_wind , max_gust , prep , cloud ,
## # events , winddir

Window (AKA analytic) functions

Joins, set operations, that’s pretty cool to have but that's not all. Additionally, a large number of analytic functions are available in dplyr. We have the familiar-from-SQL ranking functions (e.g., dense_rank(), row_number(), ntile(), and cume_dist()):

# 5% hottest days
filter(daily_2016, cume_dist(desc(mean_temp)) % select(day, mean_temp)

## # A tibble: 5 × 2
## day mean_temp
##
## 1 2016-06-24 22
## 2 2016-06-25 22
## 3 2016-07-09 22
## 4 2016-07-11 24
## 5 2016-07-30 22

# 3 coldest days
filter(daily_2016, dense_rank(mean_temp) % select(day, mean_temp) %>% arrange(mean_temp)

## # A tibble: 4 × 2
## day mean_temp
##
## 1 2016-01-22 -10
## 2 2016-01-19 -8
## 3 2016-01-18 -7
## 4 2016-01-20 -7

We have lead() and lag():

# consecutive days where mean temperature changed by more than 5 degrees:
daily_2016 %>% mutate(yesterday_temp = lag(mean_temp)) %>% filter(abs(yesterday_temp - mean_temp) > 5) %>% select(day, mean_temp, yesterday_temp)

## # A tibble: 6 × 3
## day mean_temp yesterday_temp
##
## 1 2016-02-01 10 4
## 2 2016-02-21 11 3
## 3 2016-06-26 16 22
## 4 2016-07-12 18 24
## 5 2016-08-05 14 21
## 6 2016-08-13 19 13

We also have lots of aggregation functions that, if already provided in base R, come with enhancements in dplyr. Such as, choosing the column that dictates accumulation order. New in dplyr is e.g., cummean(), the cumulative mean:

daily_2016 %>% mutate(cum_mean_temp = cummean(mean_temp)) %>% select(day, mean_temp, cum_mean_temp)

## # A tibble: 260 × 3
## day mean_temp cum_mean_temp
##
## 1 2016-01-01 2 2.0000000
## 2 2016-01-02 -1 0.5000000
## 3 2016-01-03 -2 -0.3333333
## 4 2016-01-04 0 -0.2500000
## 5 2016-01-05 2 0.2000000
## 6 2016-01-06 2 0.5000000
## 7 2016-01-07 3 0.8571429
## 8 2016-01-08 4 1.2500000
## 9 2016-01-09 4 1.5555556
## 10 2016-01-10 3 1.7000000
## # ... with 250 more rows

OK. Wrapping up so far, dplyr should make it easy to do data manipulation if you’re used to SQL. So why not just use SQL, what can we do in R that we couldn’t do before?

Visualization

Well, one thing R excels at is visualization. First and foremost, there is ggplot2, Hadley Wickham‘s famous plotting package, the realization of a "grammar of graphics". ggplot2 predates the tidyverse, but became part of it once it came to life. We can use ggplot2 to plot the average monthly temperatures from Berkeley Earth for selected cities and time ranges, like this:

cities = c("Munich", "Bern", "Oslo")
df_cities % filter(city %in% cities, year(dt) > 1949, !is.na(avg_temp))
(p_1950 <- ggplot(df_cities, aes(dt, avg_temp, color = city)) + geom_point() + xlab("") + ylab("avg monthly temp") + theme_solarized())


While this plot is two-dimensional (with axes time and temperature), a third "dimension" is added via the color aesthetic (aes (..., color = city)).

We can easily reuse the same plot, zooming in on a shorter time frame:

start_time <- as.Date("1992-01-01")
end_time <- as.Date("2013-08-01")
limits <- c(start_time,end_time)
(p_1992 <- p_1950 + (scale_x_date(limits=limits)))


It seems like overall, Bern is warmest, Oslo is coldest, and Munich is in the middle somewhere.
We can add smoothing lines to see this more clearly (by default, confidence intervals would also be displayed, but I’m suppressing them here so as to show the three lines more clearly):

(p_1992 <- p_1992 + geom_smooth(se = FALSE))


Good. Now that we have these lines, can we rely on them to obtain a trend for the temperature? Because that is, ultimately, what we want to find out about.
From here on, we’re zooming in on Munich. Let’s display that trend line for Munich again, this time with the 95% confidence interval added:

p_munich_1992 <- p_munich_1950 + (scale_x_date(limits=limits))
p_munich_1992 + stat_smooth()


Calling stat_smooth() without specifying a smoothing method uses Local Polynomial Regression Fitting (LOESS). However, we could as well use another smoothing method, for example, we could fit a line using lm(). Let’s compare them both:

loess <- p_munich_1992 + stat_smooth(method = "loess", colour = "red") + labs(title = 'loess')
lm <- p_munich_1992 + stat_smooth(method = "lm", color = "green") + labs(title = 'lm')
grid.arrange(loess, lm, ncol=2) (p_1992 <- p_1950 + (scale_x_date(limits=limits)))


Both fits behave quite differently, especially as regards the shape of the confidence interval near the end (and beginning) of the time range. If we want to form an opinion regarding a possible trend, we will have to do more than just look at the graphs - time to do some time series analysis!
Given this post has become quite long already, we'll continue in the next - so how about next winter? Stay tuned :-)

Note: This post has originally been postedhere

Read more…

Shopper Marketing -Infographic

Originally posted on Data Science Central

This infographic on Shopper Marketing was created by Steve Hashman and his team. Steve is Director at Exponential Solutions (The CUBE) Marketing. 

Shopper marketing focuses on the customer in and at the point of purchase. It is an integrated and strategic approach to a customer’s in-store experience which is seen as a driver of both sales and brand equity.

For more information, click here

DSC Resources

Additional Reading

Follow us on Twitter: @DataScienceCtrl | @AnalyticBridge

Read more…

Guest blog post by Rick Riddle

It can be tempting to lump all the people who’ve spent more of their life with social media as one large group with largely the same interests and aims. Indeed, that is what you will find many marketing firms doing. The internet is rife with articles about how to market to millennials, how snapchat is the new social media platform of the millennial generation and how Instagram has overtaken Facebook.

To do so, however, would be wrong.

Because if you dig a little deeper, you’ll very quickly indeed find out that the millennials are far from the homogeneous group they’ve been made out to be. For example, ‘they’ most certainly don’t all use snapchat. In fact, an Ipsos study of 1000 millennials between the age of 20 and 35 found that more than half don’t have a snapchat account, 1 in 10 doesn’t have a Facebook account and 40% do not use Instagram.

Figure 1 Infographic is infographic by the Smart Paper Help writing service.

In effect, to market to Millennials is a little bit like marketing to women or black people. The category is just far too big and by using it you lump people together that are entirely different and have entirely different interests.

Broad categories means less engagement

What’s more, by targeting categories this broad, there is almost no way that a person feels personally addressed by your marketing campaign. In other words, you’re not making use of one of the biggest trends in marketing that we’re currently seeing, and that is the personalization of products and websites.

In order to take advantage of that, you need to slice market segments far more thinly than the word ‘millennial’ ever could. Then you couldn’t just focus on millennials, or even millennial women, you would have to focus on millennial single mothers, for example.

What’s more, this is in many ways far easier to do, as studying the numbers in terms of smaller groups both makes it easier to find out what social media they’re using, as well as to find ways to appeal to those groups directly, by exploring topics that are immediately relevant to them. And that, in turn, will serve to significantly raise their interest and their engagement with your brand.

Modern social media allows for thin-slicing  

And besides, why wouldn’t you approach modern advertising in this way? Many social platforms allow you to thin slice who you approach and how you approach them to an amazing degree. For example, it is possible to target people in specific jobs, in specific areas, even people that work at a specific business.

This is immensely advantageous as it means you can tailor your message exactly for that group –giving them the feeling that you’re talking directly to them and giving them exactly what they might be looking for.

What’s more, by thin-slicing who you address with your advertisements and posts, you’ll manage to tighten all the bolts on the leaky faucet, so that those people that won’t benefit from being exposed to your ad (because they will not be interested or will not be able to take advantage of what you’re offering) will not be exposed to your ad. This will make them happier and will mean that you’re spending far less money on people who you’re not interested in targeting.

A warning

At the same time, there is a growing body of evidence that people are less and less comfortable with the way that social media have encroached on their privacy. A reason survey by Nation under A-Hack revealed that when asked 55% of Millennials said they would stay away from Social Media if they could start afresh and that 75% were considering closing their accounts if the security breaches continued. And that’s not the only source that shows these kinds of trends, with other infographics about how millennials use social media showing similar trends.

This matters for you, in that it is vital that you do not make them feel as if you know too much about them, as – rather than them feeling that you’ve personally connected with them there’s a good chance that this will actually creep them out.

And that can’t be good for your business.

A fine line

In other words, stay away from addressing them directly, letting them know that you’re aware where they live, or what other information you have about them. And just like with the older generations, it might be about time that we ask for permission before we start broadcasting information about what we know about individuals across our network.

The truth is, though the US has not yet caught up to the European Union in terms of privacy protection, with the way the mood is currently going, it will sooner or later start swinging in that same direction. When that happens, you want to make certain you’re on the right side of the fence.

So, personalize, and thin slice, but do not become too personable, as people still do not like the idea of business peeking into their living rooms.

About the author:

Rick Riddle is a successful blogger whose articles aim to help readers with digital marketing, entrepreneurship, career, and self-development. Feel free to connect with Rick on twitter and LinkedIn.

Read more…

Originally posted on Data Science Central

This infographic came from Medigo. It displays data from the World Health Organization’s “Projections of mortality and causes of death, 2015 and 2030”. The report details all deaths in 2015 by cause and makes predictions for 2030, giving an impression of how global health will develop over the next 14 years. Also featured is data from geoba.se showing how life expectancy will change between now and 2030.

All percentages shown have been calculated relative to projected changes in population growth.

Read original article here

DSC Resources

Additional Reading

Follow us on Twitter: @DataScienceCtrl | @AnalyticBridge

Read more…

Guest blog post by Vimal Natarajan

Introduction


The City and County of San Francisco had launched an official open data portal called SF OpenData in 2009 as a product of its official open data program, DataSF. The portal contains hundreds of city datasets for use by developers, analysts, residents and more. Under the category of Public Safety, the portal contains the list of SFPD Incidents since Jan 1, 2003.
In this post I have done an exploratory time-series analysis on the crime incidents dataset to see if there are any patterns.

Data

The data for this analysis was downloaded from the publicly available dataset from the City and County of San Francisco’s OpenData website SF OpenData. The crime incidents dataset has data recorded from the year 2003 till date. I downloaded the full dataset and performed my analysis for the time period from 2003 to 2015, filtering out the data from the year 2016. There are nearly 1.9 million crime incidents in this dataset.
I have performed minimal data processing on the downloaded raw data to facilitate my analysis.

Analysis

Crimes by Year

The following plot depicts the crimes recorded from the year 2003 till the end of the year 2015.

The horizontal line represents the average number of crimes during those years, which is just below 150,000 crimes per year. As you can observe from the year 2003 till 2007 the number of crime incidents decreased steadily. But in the year 2008 and 2009 there was a slight increase in the number of crime incidents. These two years is when the United States went through the financial and subprime mortgage crisis resulting in what is called as the Great Recession. According to the US National Bureau of Economic Research the recession began around January 2008 and ended around June 2009. As most statisticians say, “Correlation does not imply causation”, I too want to emphasize that without additional data and insights from its related analysis it may be not possible to relate these two events, but nevertheless it is an interesting observation. Following that period, there was a slight decrease in the crime incidents during the next two years but it has increased since 2012 ending up above average from year 2013 to year 2015.

Mean crimes by month

The following plot depicts the mean crimes for each month from January till December. You can observe that the mean crime for each month is more or less around the monthly average which is just below 12,000 (horizontal line). One interesting observation is that the mean crime is significantly below the monthly average for the months of February, November and December. The possible reasons could be that the month of February has less number of days compared to the other months and the festive and holiday season during the months of November and December.

Mean crimes by day of the month

The following plot depicts the mean crimes for the different days of the month.You can observe that the mean crime for each day of the month is pretty much around the daily average which is just below 400 (horizontal line) for the days from the 2nd of the month till the 28th. The mean crime during the first day of the month is significantly above average. One possible reason could be that the first day of the month is usually the pay day. Again, a correlation does not imply causation. Without additional related data and insights derived from the analysis of that data we cannot be sure. The 29th and 30th are also below average and the reason could be that the month of February does not have those days. The mean crime for the 31st of the month is around half of the daily average and that might be due to the reason that only half of the months in a year has the 31st day.

Mean crimes by hour of the day

The following plot depicts the mean crimes by the hour of the day.You can observe that this plot is very different from the other plots in the sense that the crime incidents are far from the hourly average which is around 16 (horizontal line). But within this plot you can observe some interesting pattern like the fact that crime incidents are well above average around midnight and decline steadily and significantly below the hourly average till early morning around 5 AM. From the early morning hours starting at 6 AM you can observe that the crime incidents steadily increase and spikes around noon. From noon, it is well above average peaking around 6 PM in the evening and then declining after 6 PM.

Mean crimes by day of the week

The following plot depicts the mean crimes by the day of the week.As you can observe, Sunday has the least number of crime incidents, well below the daily average which is just below 400 (vertical line) and Friday has the most number of crime incidents well above the daily average.

Mean crimes during holidays

The following plot depicts the mean crimes during few key days like holidays in the United States.You can observe here that the number of crime incidents is significantly high during the New Year, well above the daily average which is just below 400 (horizontal line). During the other holidays the number of crime incidents is more or less same as the daily average but during the Christmas Eve and the day of Christmas the number of crime incidents is significantly lower than the daily average. Since Thanksgiving Day falls in different dates each year, as an approximation I chose the date of November 24 here. I was expecting to see significantly lower crime incidents during this time period, but it does not seem to be the case.

Conclusion

In conclusion, based on the above observation, we can see some patterns in the crime incidents and arrive at the following conclusions:

  • The average number of crime incidents happening daily in the City and County of San Francisco is around 400.
  • The number of crime incidents is highest around midnight and lowest at the early morning hours.
  • The number of crime incidents is usually lower during Christmas.
  • The number of crime incidents has been slowly increasing in the recent years.
  • The number of crime incidents is high during New Year day and at the beginning of every month.

The above is just a high-level exploratory time-series analysis. With further in-depth analysis it is possible to arrive at more insights. In my future posts I will try to perform those analyses.

Technology

This analysis was performed entirely using RStudio version 0.99 and R Version 3.2.0.
The data processing and plots were done using the R libraries ggplot2 and dplyr.

Read more…

Guest blog post by Dante Munnis

Digging through messy data and doing numerous calculations just so you can submit a report or arrive at the result of your quarterly business development can sometimes be nigh impossible. After all, we are only human, and by the time we get to the other side of our spreadsheet equation, we have lost all sense of what we were trying to accomplish.

Luckily, there are data visualization and analysis tools out there that can do most of the heavy lifting for us. Remember, you will still need to do some of the work yourself, but putting it all together will become that much simpler. Let’s take a look at some of the best data analysis tools at our disposal.

1. Open Refine

At first, you might be surprised as to how much Open Refine resembles Google’s own Spread Sheets. This is because it started as a Google project but quickly became crowd sourced and independent. In practice, this means that Open Refine has all the built-in algorithms and formulas that you might need for your business data analysis.

Keep in mind that while it does resemble Spread Sheets, it doesn’t have the regular features you would expect, such as manual cell manipulation and custom algorithms. You would need to export your data and bring it back in. If that doesn’t cause too much headache, you might want to give Open Refine a shot.

2. Data Wrangler

Stanford University’s own data analysis tool is open to public use. While text manipulation and web-based interface is certainly a plus, you might consider the other factors as well. Some of the formulas provided as default don’t work really well with large amounts of data, often giving off false results or downright crashing the tool. While easy and accessible to use, Data Wrangler might not be a good tool for internal and sensitive data, since all of the data is stored at Stanford for research purposes.

3. Rapid Miner

As one of the best data visualization tools out there, Rapid Miner had to find its way to our list. It can not only manipulate and calculate custom data, analyze the required results but also model and visualize the results. This award-winning tool is known to provide great results no matter the data you are trying to analyze.

The near-perfect visualization system is just an added bonus considering everything that you are getting. If you need a tool that can help you lead and develop projects with coworkers that are less than adept at analysis, Rapid Miner is the perfect tool for the job.

4. Wolfram Alpha

Ever wonder how it would feel like if you had a personal computing assistant? Wolfram Alpha is exactly one such platform. Think of Google Search but for business analytics and data research. Whatever your field of work and specifics needs, you can be sure that Wolfram will make sense of it and help you decode any problems that you might be experiencing.

5. Solver

Sometimes you don’t need external apps or web services for your data analysis. Solver is one such addition to your Excel spread sheets. Offering a vast variety of optimization and programming algorithms, Solver will help you make sense of your data at a much faster rate than you otherwise would. It’s light, fast and easy to use, so there’s no reason not to give it a shot. Keep in mind that Solver won’t be able to make sense of more complex and demanding analysis tasks, so make sure that you use it in a smart way.

6. Google Fusion Tables

While it may not be the most versatile or complex tool on the web, Fusion Tables is one of the most accessible data visualization services out there. The best thing about it is that it is free to use and very approachable, so there’s no need for spending hours on end learning about what’s what. You can visualize your data in any shape or form you desire. Just keep in mind that you can use this tool of simple calculations and not vast sprawling data analysis tasks.

7. Zoho Reports

You might have heard about Zoho, since it’s one of the most popular business data analysis tools on the web. It’s fairly easy to get into and use, requiring only a simple log-in and data input. Use Zoho to quickly and professionally turn your data into charts, tables and pivots in order to use them for further research.

8. NodeXL

Taking the best from both worlds, Node XL is simple to use and fairly advanced in it’s algorithm possibilities. You can not only analyze and visualize raw data, but use it to develop and visualize networks and relations between different results. While some of the features might be too advanced for everyday data analysis, NodeXL is the perfect tool for more complex tasks.

9. Google Chart Tools

Another Google tool on our list that provides visualization and analysis but doesn’t focus on raw data. Instead, you can point the tool at different sources on the web and make ends meet in the visualized charts, analyzing outsourced data in order to get the results that you need. While it’s very useful and provides accurate data, Google Chart Tools isn’t very user friendly, requiring a bit of programming knowledge in order to fully utilize it’s capabilities.

10. Time Flow

Data analysis sometimes requires a different kind of visualization. Time Flow is a tool that can analyze and visualize time points and create a data map that provides a clear picture of how and when your specific data developed. While it does sound complex, the tool itself is fairly easy to use and allows a plethora of customization options. Use Time Flow whenever you need to create timelines and streamline your data.

About author: Dante Munnis is a media and marketing expert currently working at Essay Republic. He shares ideas and experience on how to build your brand and attract more customers

Read more…

Data Types and Roles in Tableau

Originally posted on Analytic Bridge

In Tableau, there are several data types that are supported. For example, you may have text values, date values, numerical values, and more. Each of the data types can take on different roles that dictate their behavior in the view.


Data Types

All fields in a data source have a data type. The data type reflects the kind of information stored in that field, for example integers (410), dates (1/23/2005) and strings (“Wisconsin”).
Mixed Data Types for Excel and CSV Files

Most columns in an Excel or CSV (comma separated value) file contain values of the same data type (dates, numbers, text). When you connect to the file, Tableau creates a field in the appropriate area of the Data window for each column. Dates and text values are dimensions, and numbers are measures.
However, a column might have a mixture of data types such as numbers and text, or numbers and dates. When you connect to the file, the mixed-value column is mapped to a field with a single data type in Tableau. Therefore, a column that contains numbers and dates might be mapped as a measure or it might be mapped as a date dimension. The mapping is determined by the data types of the first 16 rows in the data source.

Example: if most of the first 16 rows are text values, then the entire column is mapped as text.

Empty cells also create mixed-value columns because their formatting is different from text, dates, or numbers.
Depending on the data type Tableau determines for each field, the field might contain Null values for the other (non matching) records as described in the table below.

You can read more about this at Tabeau Tutorial

Read more…

Originally posted on Data Science Central

Things, not Strings
Entity-centric views on enterprise information and all kinds of data sources provide means to get a more meaningful picture about all sorts of business objects. This method of information processing is as relevant to customers, citizens, or patients as it is to knowledge workers like lawyers, doctors, or researchers. People actually do not search for documents, but rather for facts and other chunks of information to bundle them up to provide answers to concrete questions.

Strings, or names for things are not the same as the things they refer to. Still, those two aspects of an entity get mixed up regularly to nurture the Babylonian language confusion. Any search term can refer to different things, therefore also Google has rolled out its own knowledge graph to help organizing information on the web at a large scale.

Semantic graphs can build the backbone of any information architecture, not only on the web. They can enable entity-centric views also on enterprise information and data. Such graphs of things contain information about business objects (such as products, suppliers, employees, locations, research topics, …), their different names, and relations to each other. Information about entities can be found in structured (relational databases), semi-structured (XML), and unstructured (text) data objects. Nevertheless, people are not interested in containers but in entities themselves, so they need to be extracted and organized in a reasonable way.

Machines and algorithms make use of semantic graphs to retrieve not only simply the objects themselves but also the relations that can be found between the business objects, even if they are not explicitly stated. As a result, ‘knowledge lenses’ are delivered that help users to better understand the underlying meaning of business objects when put into a specific context.

Personalization of information
The ability to take a view on entities or business objects in different ways when put into various contexts is key for many knowledge workers. For example, drugs have regulatory aspects, a therapeutical character, and some other meaning to product managers or sales people. One can benefit quickly when only confronted with those aspects of an entity that are really relevant in a given situation. This rather personalized information processing has heavy demand for a semantic layer on top of the data layer, especially when information is stored in various forms and when scattered around different repositories.

Understanding and modelling the meaning of content assets and of interest profiles of users are based on the very same methodology. In both cases, semantic graphs are used, and also the linking of various types of business objects works the same way.

Recommender engines based on semantic graphs can link similar contents or documents that are related to each other in a highly precise manner. The same algorithms help to link users to content assets or products. This approach is the basis for ‘push-services’ that try to ‘understand’ users’ needs in a highly sophisticated way.

‘Not only MetaData’ Architecture
Together with the data and content layer and its corresponding metadata, this approach unfolds into a four-layered information architecture as depicted here.

Following the NoSQL paradigm, which is about ‘Not only SQL’, one could call this content architecture ‘Not only Metadata’, thus ‘NoMeDa’ architecture. It stresses the importance of the semantic layer on top of all kinds of data. Semantics is no longer buried in data silos but rather linked to the metadata of the underlying data assets. Therefore it helps to ‘harmonize’ different metadata schemes and various vocabularies. It makes the semantics of metadata, and of data in general, explicitly available. While metadata most often is stored per data source, and therefore not linked to each other, the semantic layer is no longer embedded in databases. It reflects the common sense of a certain domain and through its graph-like structure it can serve directly to fulfill several complex tasks in information management:

  • Knowledge discovery, search and analytics
  • Information and data linking
  • Recommendation and personalization of information
  • Data visualization

Graph-based Data Modelling
Graph-based semantic models resemble the way how human beings tend to construct their own models of the world. Any person, not only subject matter experts, organize information by at least the following six principles:

  1. Draw a distinction between all kinds of things: ‘This thing is not that thing’
  2. Give things names: ‘This thing is my dog Goofy’ (some might call it Dippy Dawg, but it’s still the same thing)
  3. Categorize things: ‘This thing is a dog but not a cat’
  4. Create general facts and relate categories to each other: ‘Dogs don’t like cats’
  5. Create specific facts and relate things to each other: ‘Goofy is a friend of Donald’, ‘Donald is the uncle of Huey, Dewey, and Louie’, etc.
  6. Use various languages for this; e.g. the above mentioned fact in German is ‘Donald ist der Onkel von Tick, Trick und Track’ (remember: the thing called ‘Huey’ is the same thing as the thing called ‘Tick’ - it’s just that the name or label for this thing that is different in different languages).

These fundamental principles for the organization of information are well reflected by semantic knowledge graphs. The same information could be stored as XML, or in a relational database, but it’s more efficient to use graph databases instead for the following reasons:

  • The way people think fits well with information that is modelled and stored when using graphs; little or no translation is necessary.
  • Graphs serve as a universal meta-language to link information from structured and unstructured data.
  • Graphs open up doors to a better aligned data management throughout larger organizations.
  • Graph-based semantic models can also be understood by subject matter experts, who are actually the experts in a certain domain.
  • The search capabilities provided by graphs let you find out unknown linkages or even non-obvious patterns to give you new insights into your data.
  • For semantic graph databases, there is a standardized query language called SPARQL that allows you to explore data.
  • In contrast to traditional ways to query databases where knowledge about the database schema/content is necessary, SPARQL allows you to ask “tell me what is there”.

Standards-based Semantics
Making the semantics of data and metadata explicit is even more powerful when based on standards. A framework for this purpose has evolved over the past 15 years at W3C, the World Wide Web Consortium. Initially designed to be used on the World Wide Web, many enterprises have been adopting this stack of standards for Enterprise Information Management. They now benefit from being able to integrate and link data from internal and external sources with relatively low costs.

At the base of all those standards, the Resource Description Framework (RDF) serves as a ‘lingua franca’ to express all kinds of facts that can involve virtually any kind of category or entity, and also all kinds of relations. RDF can be used to describe the semantics of unstructured text, XML documents, or even relational databases. The Simple Knowledge Organization System (SKOS) is based on RDF. SKOS is widely used to describe taxonomies and other types of controlled vocabularies. SPARQL can be used to traverse and make queries over graphs based on RDF or standard schemes like SKOS.

With SPARQL, far more complex queries can be executed than with most other database query languages. For instance, hierarchies can be traversed and aggregated recursively: a geographical taxonomy can then be used to find all documents containing places in a certain region although the region itself is not mentioned explicitly.

Standards-based semantics also helps to make use of already existing knowledge graphs. Many government organisations have made available high-quality taxonomies and semantic graphs by using semantic web standards. These can be picked up easily to extend them with own data and specific knowledge.

Semantic Knowledge Graphs will grow with your needs!
Standards-based semantics provide yet another advantage: it is becoming increasingly simpler to hire skilled people who have been working with standards like RDF, SKOS or SPARQL before. Even so, experienced knowledge engineers and data scientists are a comparatively rare species. Therefore it’s crucial to grow graphs and modelling skills over time. Starting with SKOS and extending an enterprise knowledge graph over time by introducing more schemes and by mapping to other vocabularies and datasets over time is a well established agile procedure model.

A graph-based semantic layer in enterprises can be expanded step-by-step, just like any other network. Analogous to a street network, start first with the main roads, introduce more and more connecting roads, classify streets, places, and intersections by a more and more distinguished classification system. It all comes down to an evolving semantic graph that will serve more and more as a map of your data, content and knowledge assets.

Semantic Knowledge Graphs and your Content Architecture
It’s a matter of fact that semantics serves as a kind of glue between unstructured and structured information and as a foundation layer for data integration efforts. But even for enterprises dealing mainly with documents and text-based assets, semantic knowledge graphs will do a great job.

Semantic graphs extend the functionality of a traditional search index. They don’t simply annotate documents and store occurrences of terms and phrases, they introduce concept-based indexing in contrast to term based approaches. Remember: semantics helps to identify the things behind the strings. The same applies to concept-based search over content repositories: documents get linked to the semantic layer, and therefore the knowledge graph can be used not only for typical retrieval but to classify, aggregate, filter, and traverse the content of documents.

PoolParty combines Machine Learning with Human Intelligence
Semantic knowledge graphs have the potential to innovate data and information management in any organisation. Besides questions around integrability, it is crucial to develop strategies to create and sustain the semantic layer efficiently.

Looking at the broad spectrum of semantic technologies that can be used for this endeavour, they range from manual to fully automated approaches. The promise to derive high-quality semantic graphs from documents fully automatically has not been fulfilled to date. On the other side, handcrafted semantics is error-prone, incomplete, and too expensive. The best solution often lies in a combination of different approaches. PoolParty combines Machine Learning with Human Intelligence: extensive corpus analysis and corpus learning support taxonomists, knowledge engineers and subject matter experts with the maintenance and quality assurance of semantic knowledge graphs and controlled vocabularies. As a result, enterprise knowledge graphs are more complete, up to date, and consistently used.

“An Enterprise without a Semantic Layer is like a Country without a Map.”

 

Read more…

Analyze NYC housing market through Airbnb

Originally posted on Data Science Central

Contributed by Amy(Yujing) Ma. She took NYC Data Science Academy 12 week full-time Data Science Bootcamp program  between Jan 11th to Apr 1st, 2016. The post was based on her second class project(due at 4th week of the program). 

 

Why Airbnb?

Visiting NYC?  Airbnb is  a good choice to book unique accommodations.I have used Airbnb.com for almost 3 years, this website helps me spend my vacation as a local person, gain some fantastic experience!

To better explore its rental listings across New York City, I designed this app to answer some questions: How many of the listings are for an entire home versus a room in an apartment? How many are controlled by the same host? Why is tax Issue serious in Airbnb NYC? Should you think twice before trusting a review?

Click here to play: http://216.230.228.88:3838/bootcamp004_project/Project2-Shiny/amyma/

Data Source

Data source: Inside Airbnb

Description:  Original dataset was compiled by on 09/01/2015, containing two datasets: 1. Listings (35,957 locations and 24,426 hosts); 2. reviews (366,453 reviews).

What can we learn from the app?

How many of the listings are for an entire home versus a room in an apartment?

There are 35,957 locations and 53.6% of them are entire homes/apartments. Most of the listings are in Manhattan.

How many are controlled by the same host?

Almost 20% listings are controlled by 7% of the host.

Why is tax Issue serious in Airbnb NYC?

At the second part of Listings, Neighborhoods, and Hosts, the app would show you the top n super hosts in NYC. The table shows that most of the super hosts are not local, some of them are not even a person.

For instance, Flatbook is a company to combine hotel and Airbnb together. That's why tax issue is really serious in Airbnb NYC. People are not just sharing their places, they actually rent entire homes/apartments this as a business.

Should you think twice before trusting a review? 

Yes, be careful with those reviews contains great, nice and recommend! Based on the word cloud of reviews, people use great as frequent as stay! There are three possible reasons:

Why people tend to leave positive reviews? 

There are three possible reasons: 1. It's awkward to leave a negative one. 2. They are afraid of the hosts would leave a negative too. 3. Most of the Airbnb experience is really enjoyable.

Any suggestions for Airbnb and hosts?

1. To Airbnb, I would suggest them to send review reminder on Saturday, Sunday, and Monday. Do not send any emails during weekdays. Based on the plot, people tend to leave reviews after holidays (such as 01/02, marked as A) and on weekends (marked as B).

This plot shows that people more likely to leave reviews on Mondays and Sundays; less likely to leave one on Wednesday and Thursday since they are busy at work.

2. To hosts, I would suggest hosts choose long term rentals in February and November since the fewest people are booking during those months.

And based on the word cloud of reviews, highlight their location with some keywords: such as subway, clean, restaurant and neighborhood.

Beyond my question, the app can answer much more. To answer your questions on Airbnb in NYC, you can play with my app.

How to play?

Airbnb Listings in NYC 

To gain the basic information about Airbnb Listings in NYC, the first tab would map the whole listings.

Every circle on the map indicates one listing and different colors indicate different room types (red-Entire Home/Apt, blue- Private Room and green- Shared Room).

To get basic information on the average price in the particular neighborhood, for example, to show the average price in Manhattan, we can choose "Manhattan" on the left panel.

The right panel and map will change based on your input. In this example, I’ve selected “Manhattan".

Listings, Neighborhoods, and Hosts

The next tab shows some details about the Airbnb Listings:

To change the graphs, you can simply click the button "Make a change?"

Review by Time

The third and fourth tab would help you discover the review related to Airbnb listing in NYC.

To help users better see patterns, trends for example, in the Number of Reviews Over Time  plot, there are two methods:

Change the number in the text box at the bottom-left of the plot,  which would average the number of reviews in specified number of days. For instance, changing the number into 7, would show a smoother trend:

Go directly to the bottom plot, and show data by specified time period

Word Cloud-- Reviews

Thanks for reading

Thanks for reading, I hope you found this post and my app interesting. I want to improve this app to supply more information so if you have any suggestions please feel free to leave a comment or contact me directly.

Read more…

Originally posted on Data Science Central

Original post published to DataScience+

In this post I will show how to collect data from a webpage and to analyze or visualize in R. For this task I will use the rvest package and will get the data from Wikipedia. I got the idea to write this post from Fisseha Berhane.
I will gain access to the prevalence of obesity in United States from Wikipedia page, then I will plot it in the map. Lets begin with loading the required packages.

## LOAD THE PACKAGES ####
library(rvest)
library(ggplot2)
library(dplyr)
library(scales)


Download the data from Wikipedia.

## LOAD THE DATA ####
obesity = read_html("https://en.wikipedia.org/wiki/Obesity_in_the_United_States")
obesity = obesity %>%
html_nodes("table") %>%
.[[1]]%>%
html_table(fill=T)


The first line of code is calling the data from Wikipedia and the second line of codes is transforming the table that we are interested into dataframe in R.
The head of our data.

head(obesity)
State and District of Columbia Obese adults Overweight (incl. obese) adults
1 Alabama 30.1% 65.4%
2 Alaska 27.3% 64.5%
3 Arizona 23.3% 59.5%
4 Arkansas 28.1% 64.7%
5 California 23.1% 59.4%
6 Colorado 21.0% 55.0%
Obese children and adolescents Obesity rank
1 16.7% 3
2 11.1% 14
3 12.2% 40
4 16.4% 9
5 13.2% 41
6 9.9% 51


The dataframe looks good, now we need to clean it from making ready to plot.

## CLEAN THE DATA ####
str(obesity)
'data.frame': 51 obs. of 5 variables:
$ State and District of Columbia : chr "Alabama" "Alaska" "Arizona" "Arkansas" ...
$ Obese adults : chr "30.1%" "27.3%" "23.3%" "28.1%" ...
$ Overweight (incl. obese) adults: chr "65.4%" "64.5%" "59.5%" "64.7%" ...
$ Obese children and adolescents : chr "16.7%" "11.1%" "12.2%" "16.4%" ...
$ Obesity rank : int 3 14 40 9 41 51 49 43 22 39 ...

# remove the % and make the data numeric
for(i in 2:4){
obesity[,i] = gsub("%", "", obesity[,i])
obesity[,i] = as.numeric(obesity[,i])
}
# check data again
str(obesity)
'data.frame': 51 obs. of 5 variables:
$ State and District of Columbia : chr "Alabama" "Alaska" "Arizona" "Arkansas" ...
$ Obese adults : num 30.1 27.3 23.3 28.1 23.1 21 20.8 22.1 25.9 23.3 ...
$ Overweight (incl. obese) adults: num 65.4 64.5 59.5 64.7 59.4 55 58.7 55 63.9 60.8 ...
$ Obese children and adolescents : num 16.7 11.1 12.2 16.4 13.2 9.9 12.3 14.8 22.8 14.4 ...
$ Obesity rank : int 3 14 40 9 41 51 49 43 22 39 ...


Fix the names of variables by removing the spaces.

names(obesity)
[1] "State and District of Columbia" "Obese adults"
[3] "Overweight (incl. obese) adults" "Obese children and adolescents"
[5] "Obesity rank"

names(obesity) = make.names(names(obesity))
names(obesity)
[1] "State.and.District.of.Columbia" "Obese.adults"
[3] "Overweight..incl..obese..adults" "Obese.children.and.adolescents"
[5] "Obesity.rank"


Now, it's time to load the map data.

# load the map data
states = map_data("state")
str(states)
'data.frame': 15537 obs. of 6 variables:
$ long : num -87.5 -87.5 -87.5 -87.5 -87.6 ...
$ lat : num 30.4 30.4 30.4 30.3 30.3 ...
$ group : num 1 1 1 1 1 1 1 1 1 1 ...
$ order : int 1 2 3 4 5 6 7 8 9 10 ...
$ region : chr "alabama" "alabama" "alabama" "alabama" ...
$ subregion: chr NA NA NA NA ...


Merge two datasets (obesity and states) by region, therefore we first need to create a new variable (region) in obesity dataset.

# create a new variable name for state
obesity$region = tolower(obesity$State.and.District.of.Columbia)


Merge the datasets.

states = merge(states, obesity, by="region", all.x=T)
str(states)
'data.frame': 15537 obs. of 11 variables:
$ region : chr "alabama" "alabama" "alabama" "alabama" ...
$ long : num -87.5 -87.5 -87.5 -87.5 -87.6 ...
$ lat : num 30.4 30.4 30.4 30.3 30.3 ...
$ group : num 1 1 1 1 1 1 1 1 1 1 ...
$ order : int 1 2 3 4 5 6 7 8 9 10 ...
$ subregion : chr NA NA NA NA ...
$ State.and.District.of.Columbia : chr "Alabama" "Alabama" "Alabama" "Alabama" ...
$ Obese.adults : num 30.1 30.1 30.1 30.1 30.1 30.1 30.1 30.1 30.1 30.1 ...
$ Overweight..incl..obese..adults: num 65.4 65.4 65.4 65.4 65.4 65.4 65.4 65.4 65.4 65.4 ...
$ Obese.children.and.adolescents : num 16.7 16.7 16.7 16.7 16.7 16.7 16.7 16.7 16.7 16.7 ...
$ Obesity.rank : int 3 3 3 3 3 3 3 3 3 3 ...

Plot the data


Finally we will plot the prevalence of obesity in adults.

## MAKE THE PLOT ####
# adults
ggplot(states, aes(x = long, y = lat, group = group, fill = Obese.adults)) +
geom_polygon(color = "white") +
scale_fill_gradient(name = "Percent", low = "#feceda", high = "#c81f49", guide = "colorbar", na.value="black", breaks = pretty_breaks(n = 5)) +
labs(title="Prevalence of Obesity in Adults") +
coord_map()


Here is the plot in adults:
adults
Similarly, we can plot the prevalence of obesity in children.

# children
ggplot(states, aes(x = long, y = lat, group = group, fill = Obese.children.and.adolescents)) +
geom_polygon(color = "white") +
scale_fill_gradient(name = "Percent", low = "#feceda", high = "#c81f49", guide = "colorbar", na.value="black", breaks = pretty_breaks(n = 5)) +
labs(title="Prevalence of Obesity in Children") +
coord_map()


Here is the plot in children:
children
If you like to show the name of State in the map use the code below to create a new dataset.

statenames = states %>% 
group_by(region) %>%
summarise(
long = mean(range(long)),
lat = mean(range(lat)),
group = mean(group),
Obese.adults = mean(Obese.adults),
Obese.children.and.adolescents = mean(Obese.children.and.adolescents)
)


After you add this code to ggplot code above

geom_text(data=statenames, aes(x = long, y = lat, label = region), size=3)


That's all. I hope you learned something useful today.

Read more…

Guest blog post by Vimal Natarajan

Introduction

The City and County of San Francisco had launched an official open data portal called SF OpenData in 2009 as a product of its official open data program, DataSF. The portal contains hundreds of city datasets for use by developers, analysts, residents and more. Under the category of Public Safety, the portal contains the list of SFPD Incidents since Jan 1, 2003.
In my previous post I performed an exploratory time-series analysis on the crime incidents data to identify any patterns.
In this post I have performed an exploratory geo analysis on the crime incidents data to identify any patterns based on the San Francisco Police Department District classification.

Data

The data for this analysis has been downloaded from the publicly available data from the City and County of San Francisco’s OpenData website SF OpenData. The crime incidents database has data recorded from the year 2003 till date. I downloaded the full data and performed my analysis for the time period from 2003 to 2015, filtering out the data from the year 2016. There are nearly 1.9 million crime incidents in this dataset.
I have performed minimal data processing on the downloaded raw data to facilitate my analysis.


SFPD Police Districts

There are 10 police districts in the City and County of San Francisco. I have categorized my analysis based on these Police Districts.
•    Bayview
•    Central
•    Ingleside
•    Mission
•    Northern
•    Park
•    Richmond
•    Southern
•    Taraval
•    Tenderloin

Analysis

Crimes in Police District over the Years

The following plot depicts the number of crimes recorded from the year 2003 till the end of the year 2015 and categorized by the SFPD Police Districts.

By analyzing the plot above, we can arrive at the following insights:

  • Southern district has the highest number of crimes over the years and the number of crimes has been increasing in the last few years. In addition, Central district also had sharp increase in the number of crimes in the last few years. Northern district despite seeing a steady decline in the number of crimes from the year 2003 till 2010, have had sharp increase in the number of crimes since 2011.
  • Park and Richmond districts have had the lowest number of crimes over these years.
  • Only Mission and Tenderloin has seen a steady decline in the number of crimes in recent years.


Crimes in Police District by Hour of the Day

The following plot depicts the number of crimes recorded by the hour and categorized by the SFPD Police Districts.

By analyzing the plot above, we can arrive at the following insights:

  • The number of crimes steadily decline from midnight and are at the lowest during the early morning hours and then they start increasing and peak around 6 PM in the evening. This is the same insight we arrived in my previous analysis but here we have categorized by the Police district and still see the same pattern.
  • As seen in the previous plot, Park and Richmond districts have the lowest number of crimes throughout the day.
  • As highlighted in red in the plot above, the maximum number of crimes happens in Southern district around 6 PM in the evening.


Crimes in Police District by Day of Week

The following plot depicts the number of crimes recorded during different days of the week and categorized by the SFPD Police Districts.

By analyzing the plot above, we can arrive at the following insights:

  • In general, there is less number of crimes happening during the weekends than weekdays across all districts. The only exception here is Central where more number of crimes happens during the weekend, particularly on Saturdays. One possible reason could be that there are more number of people around the Pier on Saturdays.
  • One more observation is that the crimes usually peak on Fridays across all districts with the exception of Tenderloin, where Wednesday seems to have most number of crimes.
  • As highlighted in red in the plot above, Fridays in Southern district seems to have the maximum number of crimes happening. Taking into consideration the analysis from the previous plot, it is possible that Fridays around 6 PM in Southern district seems to be the most dangerous time with regards to the number of crimes happening.

Technology

This analysis was performed entirely using RStudio version 0.99 and R Version 3.2.0.
The data processing and plots were done using the R libraries ggplot2 and dplyr.

Read more…

Guest blog post by Klodian

Original post is published at DataScience+

Recently, I become interested to grasp the data from webpages, such as Wikipedia, and to visualize it with R. As I did in my previous post, I use rvest package to get the data from webpage and ggplot package to visualize the data.
In this post, I will map the life expectancy in White and African-American in US.
Load the required packages.

## LOAD THE PACKAGES ####
library(rvest)
library(ggplot2)
library(dplyr)
library(scales)

Import the data from Wikipedia.

## LOAD THE DATA ####
le = read_html("https://en.wikipedia.org/wiki/List_of_U.S._states_by_life_expectancy")
le = le %>%
html_nodes("table") %>%
.[[2]]%>%
html_table(fill=T)

Now I have to clean the data. Below I have explain the role of each code.

## CLEAN THE DATA ####
# select only columns with data
le = le[c(1:8)]
# get the names from 3rd row and add to columns
names(le) = le[3,]
# delete rows and columns which I am not interested
le = le[-c(1:3), ]
le = le[, -c(5:7)]
# rename the names of 4th and 5th column
names(le)[c(4,5)] = c("le_black", "le_white")
# make variables as numeric
le = le %>%
mutate(
le_black = as.numeric(le_black),
le_white = as.numeric(le_white))

Since there are some differences in life expectancy between White and African-American, I will calculate the differences and will map it.

le = le %>% mutate(le_diff = (le_white - le_black))

I will load the map data and will merge the datasets togather.
## LOAD THE MAP DATA ####
states = map_data("state")
# create a new variable name for state
le$region = tolower(le$State)
# merge the datasets
states = merge(states, le, by="region", all.x=T)

Now its time to make the plot. First I will plot the life expectancy in African-American in US. For few states we don't have the data, and therefore I will color it in grey color.
## MAKE THE PLOT ####
# Life expectancy in African American
ggplot(states, aes(x = long, y = lat, group = group, fill = le_black)) +
geom_polygon(color = "white") +
scale_fill_gradient(name = "Years", low = "#ffe8ee", high = "#c81f49", guide = "colorbar", na.value="#eeeeee", breaks = pretty_breaks(n = 5)) +
labs(title="Life expectancy in African American") +
coord_map()

Here is the plot:
Le_african_american

The code below is for White people in US.

# Life expectancy in White American
ggplot(states, aes(x = long, y = lat, group = group, fill = le_white)) +
geom_polygon(color = "white") +
scale_fill_gradient(name = "Years", low = "#ffe8ee", high = "#c81f49", guide = "colorbar", na.value="Gray", breaks = pretty_breaks(n = 5)) +
labs(title="Life expectancy in White") +
coord_map()

Here is the plot:
Le_white

Finally, I will map the differences between white and African American people in US.

# Differences in Life expectancy between White and African American
ggplot(states, aes(x = long, y = lat, group = group, fill = le_diff)) +
geom_polygon(color = "white") +
scale_fill_gradient(name = "Years", low = "#ffe8ee", high = "#c81f49", guide = "colorbar", na.value="#eeeeee", breaks = pretty_breaks(n = 5)) +
labs(title="Differences in Life Expectancy between \nWhite and African Americans by States in US") +
coord_map()

Here is the plot:
Le_differences
On my previous post I got a comment to add the pop-up effect as I hover over the states. This is a simple task as Andrea exmplained in his comment. What you have to do is to install the plotly package, to create a object for ggplot, and then to use this function ggplotly(map_plot) to plot it.

library(plotly)
map_plot = ggplot(states, aes(x = long, y = lat, group = group, fill = le_black)) +
geom_polygon(color = "white") +
scale_fill_gradient(name = "Years", low = "#ffe8ee", high = "#c81f49", guide = "colorbar", na.value="#eeeeee", breaks = pretty_breaks(n = 5)) +
labs(title="Life expectancy in African American") +
coord_map()
ggplotly(map_plot)

Here is the plot:
le_plotly
Thats all! Leave a comment below if you have any question.

Original post: Map the Life Expectancy in United States with data from Wikipedia

Read more…

Why Visualize Data?

Guest blog post by Surendran B

If a picture is worth thousand words, then what about a neat data visualization? Displaying information in graphics to generate better insights is not a new phenomenon, but, with the advent of technology and increased access to data, it has become far more prominent. Once restricted to analysis of economics, finance, and science, data visualization has emerged as an industry of its own.

There are now multiple tools to visualize data (like SocialCops Viz), competitions for visualizations, and even data visualization artists. From a haunting depiction of gun violence in America and an assessment of India’s RTE Act to a series of charts highlighting a footballer’s greatness and a collection of maps depicting India’s size, data visualization transcends fields — all data can be visualized.

Data in the eye of the beholder

Why visualize data? Beyond the mere aesthetic attraction of a beautiful graphic, data visualization matters because it can be extremely helpful. It is easier for the human brain to process large volumes of data through visuals rather than text. Studies have shown that humans find it easier to distinguish line length, shape orientation, and color — collectively known as pre-attentive attributes — than read a series of numbers. This is because around two-third of our brain’s neurons are dedicated solely to vision. This makes it easier and quicker to interpret information visually.

Take the example of India’s GDP since Independence. The narrative is a familiar one. After Independence, Indian economic growth was anaemic (rarely hovering above 5%). It was only in the late 1980s and 1990s that growth really accelerated, driven by a tremendous increase in services output. Below are a table and graph conveying this same information, but the graph tells the story in a way that’s both more intuitive and informative.

Data visualization allows us to identify these sorts of trends, along with problems and possible solutions. This makes it a valuable tool for anyone — especially those working in policy. Policymakers use swathes of data across sectors to make important decisions. In this ocean of information, data visualization can quickly show them what needs to be refined or aborted. In a country as diverse and large as India, data visualization’s ability to show data effectively and quickly is paramount.

Data visualization for policy

The Government of India has embraced the potential of data visualization. As part of its Open Data platform, public users are encouraged to create their own visualizations to show different perspectives on government performance. Some ministries are going a step further by implementing their own data visualization initiatives. For instance, the Ministry of Rural Development has created adashboard on MGNREGA implementation (the Indian government’s flagship workfare program). The dashboard has an intuitive interface that provides administrators with real-time visualized summaries of the program’s performance at all administrative levels (from gram panchayats to the center).

At SocialCops, we work with all sorts of decision makers to create intuitive data visualizations. For example, district collectors and other government officials use our platform to create dashboards, which provide important insights to help officials identify pain points, assess their progress, and target important schemes or initiatives to the places that need them most. Individual users can also create maps to better understand their data using our Viz tool.

Below is an example of a data visualization from the Socio Economic Caste Census — a nation-wide survey of socio-economic conditions of households across India.

The map above shows the proportion of rural households with kuccha roofs (thatched, plastic, or hand-made tiled roofs) in every district in India. The same data could have been described in words — for example, districts in Chhattisgarh, Odisha, and Madhya Pradesh have the highest proportion of kuccha roofs while districts in the South have the lowest, and so on. However, all that information — and more — is revealed through a quick glance at the map.

Many people now say that we are living in an era of big data. By some estimates, we are generating 2.5 quintillion (that’s 18 zeros) bytes of data daily. This data can be used for amazing policies and initiatives. However, for this to happen, data has to be managed and interpreted correctly.

This article was originally published here

Read more…

4 Potential Problems With Data Visualization

Originally posted on Data Science Central

Big data has been a big topic for a few years now, and it’s only going to grow bigger as we get our hands on more sophisticated forms of technology and new applications in which to use them. The problem now is beginning to shift; originally, tech developers and researchers were all about gathering greater quantities of data. Now, with all this data in tow, consumers and developers are both eager for new ways to condense, interpret, and take action on this data.

One of the newest and most talked-about methods for this is data visualization, a system of reducing or illustrating data in simplified, visual ways. The buzz around data visualization is strong and growing, but is the trend all it’s cracked up to be?

The Need for Data Visualization

There’s no question that data visualization can be a good thing, and it’s already helped thousands of marketers and analysts do their jobs more efficiently. Human abilities for pattern recognition tend to revolve around sensory inputs—for obvious reasons. We’re hard-wired to recognize visual patterns at a glance, but not to crunch complex numbers and associate those numbers with abstract concepts. Accordingly, representing complex numbers as integrated visual patterns would allow us to tap into our natural analytic abilities.

The Problems With Visualization

Unfortunately, there are a few current and forthcoming problems with the concept of data visualization:

  1. The oversimplification of data. One of the biggest draws of visualization is its ability to take big swaths of data and simplify them to more basic, understandable terms. However, it’s easy to go too far with this; trying to take millions of data points and confine their conclusions to a handful of pictoral representations could lead to unfounded conclusions, or completely neglect certain significant modifiers that could completely change the assumptions you walk away with. As an example not relegated to the world of data, consider basic real-world tests, such as alcohol intoxication tests, which try to reduce complex systems to simple “yes” or “no” results—as Monder Law Group points out, these tests can be unreliable and flat-out inaccurate.

  2. The human limitations of algorithms. This is the biggest potential problem, and also the most complicated. Any algorithm used to reduce data to visual illustrations is based on human inputs, and human inputs can be fundamentally flawed. For example, a human developing an algorithm may highlight different pieces of data that are “most” important to consider, and throw out other pieces entirely; this doesn’t account for all companies or all situations, especially if there are data outliers or unique situations that demand an alternative approach. The problem is compounded by the fact that most data visualization systems are rolled out on a national scale; they evolve to become one-size-fits-all algorithms, and fail to address the specific needs of individuals.

  3. Overreliance on visuals. This is more of a problem with consumers than it is with developers, but it undermines the potential impact of visualization in general. When users start relying on visuals to interpret data, which they can use at-a-glance, they could easily start over-relying on this mode of input. For example, they may take their conclusions as absolute truth, never digging deeper into the data sets responsible for producing those visuals. The general conclusions you draw from this may be generally applicable, but they won’t tell you everything about your audiences or campaigns.

  4. The inevitability of visualization. Already, there are dozens of tools available to help us understand complex data sets with visual diagrams, charts, and illustrations, and data visualization is too popular to ever go away. We’re on a fast course to visualization taking over in multiple areas, and there’s no real going back at this point. To some, this may not seem like a problem, but consider some of the effects—companies racing to develop visualization products, and consumers only seeking products that offer visualization. These effects may feed into user overreliance on visuals, and compound the limitations of human errors in algorithm development (since companies will want to go to market as soon as possible).

There’s no stopping the development of data visualization, and we’re not arguing that it should be stopped. If it’s developed in the right ways, it can be an extraordinary tool for development in countless different areas—but collectively, we need to be aware of the potential problems and biggest obstacles data visualization will need to overcome. 

Read more…

Webinar Series

Follow Us

@DataScienceCtrl | RSS Feeds

Careers

Software Engineer

US Soccer Federation - The Opportunity: We are U.S. Soccer and we are the future of sport in the United States. Our mission is to make soccer a preeminent sport in the Un...

Data Scientist

MAYO CLINIC - “A Life-Changing Career”   What if your career could change your life? Perhaps you imagine being part of a team where your colleagues inspire you t...

Data Engineer - Bosch

Bosch USA - Robert Bosch is a world-class engineering and electronics company with over 200 plants and thousands of assembly lines world-wide. Our products imp...