Subscribe to our Newsletter

Featured Posts (202)

Sort by

Originally posted on Data Science Central

Do you want to learn the history of data visualization? Or do you want to learn how to create more engaging visualizations and see some examples? It’s easy to feel overwhelmed with the amount of information available today, which is why sometimes the answer can be as simple as picking up a good book.

These seven amazing data visualization books are a great place for you to get started:

1) Show Me the Numbers: Designing Tables and Graphs to Enlighten, Second Edition

Stephen Few

2) The Accidental Analyst: Show Your Data Who’s Boss

Eileen and Stephen McDaniel

3) Information Graphics

Sandra Rendgen, Julius Wiedemann

4) Visualize This: The FlowingData Guide to Design, Visualization, and Statistics

Nathan Yau

5) Storytelling with Data

Cole Nussbaumer Knaflic

6) Cool Infographics

Randy Krum

7) Designing Data Visualizations: Representing Informational Relationships

Noah Iliinsky, Julie Steele

To check out the 7 data visualization books, click here. For other articles about data visualization, click here.

Top DSC Resources

Follow us on Twitter: @DataScienceCtrl | @AnalyticBridge

Read more…

10 Dataviz Tools To Enhance Data Science

Originally posted on Data Science Central

This article on data visualization tools was written by Jessica Davis. She's passionate about the practical use of business intelligence, predictive analytics, and big data for smarter business and a better world.

Data visualizations can help business users understand analytics insights and actually see the reasons why certain recommendations make the most sense. Traditional business intelligence and analytics vendors, as well as newer market entrants, are offering data visualization technologies and platforms.

Here's a collection of 10 data visualization tools worthy of your consideration:

Tableau Software

Tableau Software is perhaps the best known platform for data visualization across a wide array of users. Some Coursera courses dedicated to data visualization use Tableau as the underlying platform. The Seattle-based company describes its mission this way: "We help people see and understand their data."

This company, founded in 2003, offers a family of interactive data visualization products focused on business intelligence. The software is offered in desktop, server, and cloud versions. There's also a free public version used by bloggers, journalists, quantified-self hobbyists, sports fans, political junkies, and others.

Tableau was one of three companies featured in the Leaders square of the 2016 Gartner Magic Quadrant for Business Intelligence and Analytics Platforms.


Qlik was founded in Lund, Sweden in 1993. It's another of the Leaders in Gartner's 2016 Magic Quadrant for Business Intelligence and Analytics Platforms. Now based in Radnor, Penn., Qlik offers a family of products that provide data visualization to users. Its new flagship Qlik Sense offers self-service visualization and discovery. The product is designed for drag-and-drop creation of interactive data visualizations. It's available in versions for desktop, server, and cloud.

Oracle Visual Analyzer

Gartner dropped Oracle from its 2016 Magic Quadrant Business Intelligence and Analytics Platform report. One of the company's newer products, Oracle Visual Analyzer, could help the database giant make it back into the report in years to come.

Oracle Visual Analyzer, introduced in 2015, is a web-based tool provided within the Oracle Business Intelligence Cloud Service. It's available to existing customers of Oracle's Business Intelligence Cloud. The company's promotional materials promise advanced analysis and interactive visualizations. Configurable dashboards are also available.

SAS Visual Analytics

SAS is one of the traditional vendors in the advanced analytics space, with a long history of offering analytical insights to businesses. SAS Visual Analytics is among its many offerings.

The company offers a series of sample reports showing how visual analytics can be applied to questions and problems in a range of industries. Examples include healthcare claims, casino performance, digital advertising, environmental reporting, and the economics of Ebola outbreaks.

Microsoft Power BI

Microsoft Power BI, the software giant's entry in the data visualization space, is the third and final company in the Leaders square of the Gartner 2016 Magic Quadrant for Business Intelligence and Analytics Platforms.

Power BI is not a monolithic piece of software. Rather, it's a suite of business analytics tools Microsoft designed to enable business users to analyze data and share insights. Components include Power BI dashboards, which offer customizable views for business users for all their important metrics in real-time. These dashboards can be accessed from any device.

Power BI Desktop is a data-mashup and report-authoring tool that can combine data from several sources and then enable visualization of that data. Power BI gateways let organizations connect SQL Server databases and other data sources to dashboards.

TIBCO Spotfire

TIBCO acquired data discovery specialist Spotfire in 2007. The company offers the technology as part of its lineup of data visualization and analytics tools. TIBCO updated Spotfire in March 2016 to improve core visualizations. The updates expand built-in data access and data preparation functions, and improve data collaboration and mashup capabilities. The company also redesigned its Spotfire server topology with simplified web-based admin tools.

ClearStory Data

Founded in 2011, ClearStory Data is one of the newer players in the space. Its technology lets users discover and analyze data from corporate, web, and premium data sources. It includes relational databases, Hadoop, web, and social application interfaces, as well as ones from third-party data providers. The company offers a set of solutions for vertical industries. Its customers include Del Monte, Merck, and Coca-Cola.


The web-enabled platform from Sisense offers interactive dashboards that let users join and analyze big and multiple datasets and share insights. Gartner named the company a Niche Player in its Magic Quadrant report for Business Intelligence and Analytics Platforms. The research firm said the company was one of the top two in terms of accessing large volumes of data from Hadoop and NoSQL data sources. Customers include eBay, Lockheed Martin, Motorola, Experian, and Fujitsu.

Dundas BI

Mentioned as a vendor to watch by Gartner, but not included in the company's Magic Quadrant for Business Intelligence and Analytics Platforms, Dundas BI enables organizations to create business intelligence dashboards for the visualization of key business metrics. The platform also enables data discovery and exploration with drag-and-drop menus. According to the company's website, a variety of data sources can be connected, including relational, OLAP, flat files, big data, and web services. Customers include AAA, Bank of America, and Kaiser Permanente.


Inet Software is another vendor that didn't qualify for the Gartner report, but was mentioned by the research firm as a company to watch.

InetSoft offers a colorful gallery of BI Visualizations. A free version of its software provides licenses for two users. It lets organizations take the software for a test drive. Serious users will want to upgrade to the paid version. Customers include Flight Data Services, eScholar, ArcSight, and

You can find the original article, here. For other articles about data visualization, click here.

Top DSC Resources

Follow us on Twitter: @DataScienceCtrl | @AnalyticBridge

Read more…

Whether you're working on a school presentation or preparing a monthly sales report for your boss, presenting your data in a detailed and easy-to-follow form is essential. It's hard to keep the focus of your audience if you can't help them fully understand the data you're trying to explain. The best way to understand complex data is to show your results in a graphic form. This is the main reason why data visualization has become a key part of all presentations and data analysis. But let's see what are the top 5 benefits of using data visualization in your work.

Easier data discovery

Visualization of your data helps you and your audience to find specific information. Pointing out an information strictly as one-dimensional graphics can be difficult if you have a lot of data to work with. Data visualization can make this effort a whole lot easier.

Simple way to trace data correlations

Sometimes it's hard to notice the correlation between two sets of data. If you present your data in graphic form, you can notice how one set of data influences another. This is a major benefit as it reduces a great amount of work effort you need to invest.

Live interaction with data

Data visualization offers you the benefit of live interaction with any piece of data you need. This enables you to spot the change in data as it happens. And you don't get just simple information regarding the change, you also get a predictive analysis.

Promote a new business language

One of the major benefits of data visualization over simple graphic solutions is the ability to "tell a story" through data. Per example, with a simple graphic chart, you get an information and that's it. Data visualization enables you to not only see the information but also to know the reasons behind it.

Identify trends

Ability to identify trends is one of the most interesting benefits that data visualization tools have to offer. You can watch the progress of certain data and see the reasons for those changes. With predictive analysis, you can also predict the behavior of those trends in the future.


Data visualization tools have become a necessity in modern data analysis. This need grew start of many businesses that offer data visualization services. 

All in all, data visualization tools have shifted the analytics to a whole new level and allowed a better insight into business data. Let us know about your experience with data visualization tools and how you use them, we'd love to read how it improved your work. 

Read more…

BI Tools for SMEs? Not Just Maybe, But DEFINITELY

I am working as BI consultant and aim to provide best BI Solutions to my clients. Focusing on BI for Tally and upgrading Tally customers to self-servicing BI environment with interactive reports and Dashboard for Tally. Apart from this I like traveling, participating in Business Intelligence forums, reading and social networking.
Read more…

Originally posted on Data Science Central

This article on going deeper into regression analysis with assumptions, plots & solutions, was posted by Manish Saraswat. Manish who works in marketing and Data Science at Analytics Vidhya believes that education can change this world. R, Data Science and Machine Learning keep him busy.

Regression analysis marks the first step in predictive modeling. No doubt, it’s fairly easy to implement. Neither it’s syntax nor its parameters create any kind of confusion. But, merely running just one line of code, doesn’t solve the purpose. Neither just looking at R² or MSE values. Regression tells much more than that!

In R, regression analysis return 4 plots using plot(model_name) function. Each of the plot provides significant information or rather an interesting story about the data. Sadly, many of the beginners either fail to decipher the information or don’t care about what these plots say. Once you understand these plots, you’d be able to bring significant improvement in your regression model.

For model improvement, you also need to understand regression assumptions and ways to fix them when they get violated.

In this article, I’ve explained the important regression assumptions and plots (with fixes and solutions) to help you understand the regression concept in further detail. As said above, with this knowledge you can bring drastic improvements in your models.

What you can find in this article :

Assumptions in Regression

What if these assumptions get violated ?

  1. Linear and Additive
  2. Autocorrelation
  3. Multicollinearity
  4. Heteroskedasticity
  5. Normal Distribution of error terms

Interpretation of Regression Plots

  1. Residual vs Fitted Values
  2. Normal Q-Q Plot
  3. Scale Location Plot
  4. Residuals vs Leverage Plot

You can find the full article here. For other articles about regression analysis, click here. 

Note from the Editor: For a robust regression that will work even if all these model assumptions are violated, click here. It is simple (it can be implemented in Excel and it is model-free), efficient and very comparable to the standard regression (when the model assumptions are not violated).  And if you need confidence intervals for the predicted values, you can use the simple model-free confidence intervals (CI) described here. These CIs are equivalent to those being taught in statistical courses, but you don't need to know stats to understand how they work, and to use them. Finally, to measure goodness-of-fit, instead of R-Squared or MSE, you can use this metric, which is more robust against outliers. 

DSC Resources

Additional Reading

Follow us on Twitter: @DataScienceCtrl | @AnalyticBridge

Read more…

‘Key Performance Indicators’ or KPIs as we say, are very important to the enterprise and nearly every company is talking about them, these days. But, there are still a lot of businesses that don’t know how to define the right KPIs to get a good picture of success.

To really understand where you are succeeding and where you are falling short, you have to measure the right things. For example, if your goal is to increase sales in the Minneapolis store by 5% in the year 2015, you couldn’t determine success by establishing a KPI to measure the number of shopping bags you have on hand in the store. Do we care about the number of sales people on staff at a certain time of day, and whether that affects our sales? Do we want to look at the store hours for a particular day of the week to determine whether extended hours in a certain season or on a certain day may result in more sales? Should we look at the impact of sales rep training on closed sales?

In like manner, if you want to establish metrics to evaluate the effectiveness of your internet marketing program you’ll probably have to look at your program from various perspectives. That is true of nearly every initiative in your company and that is where many businesses go awry. They assume that they can establish one metric for each goal when, in fact your business is more complex than that and your goals usually have more than one factor or aspect that will determine success.

Let’s consider the KPIs for an internet marketing program. We can’t just say that we want to increase sales. We have to decide how we will determine success. Will we include site visits, visits per page, the click to conversion ratio, the number of email and newsletter ‘unsubscribe’ requests, the click through rates for visitors coming to the website from a social media site, etc. These factors might tell us which internet marketing techniques are driving traffic to our site, but do they tell us whether this traffic is coming from our target audience, or what percentage of the traffic from each source is actually resulting in a purchase? Do they measure the time of day, the day of the week, or the season in which these sales conversions are most likely?

Of course, every business, industry, location and team is different and you have to look carefully at your own business to determine what is relevant. The most important thing to ask yourself when you establish KPIs is, ‘how does this measurement correlate to our success?’ If I measure this particular thing, does the resulting number or data point give me any insight into how well we are doing, how much money we are making, and whether this task, activity or goal is actually having an effect on the overall performance of the business?

There is one final point to consider when establishing  Key Performance Indicators (KPIs) and an integrated business intelligence approach to decision-making. Enterprise culture and communication is important. There are industry standard, and business function-specific business intelligence tools with KPI modules, but these solutions still have to be tailored to the individual organization, and to their targets, and the minimums and maximums to be defined and then gradually moved to the teams for adoption. In order to get a true picture of KPIs and business intelligence, the enterprise must integrate data from disparate data sources and systems and that takes careful planning and implementation.

Throughout this process, the business must be committed to building a performance driven culture, and to streamlining and improving communication, and, in all likelihood, the process of getting to the desired state will be an iterative process. It may seem like the enterprise is taking the long way around. But, the business team must focus on building for the long-term, and to achieving solid results and a culture that supports clear, concise, objective decision-making and full commitment to business success at every level.

If a business is committed to performance-driven management, it must link its goals to its processes and create key performance indicators that objectively measure performance and keep the company on track. Whether your goal is to create a successful eCommerce site, increase customer satisfaction by 15% or reduce expenses, you must have a good understanding of what you mean by the word ‘success’.

Read more…

Successful sales force management is dependent on up-to-date, accurate information. With appropriate, easy access to business intelligence, a Sales Director and Sales Managers can monitor goals and objectives. But, that’s not all a business intelligence tool can do for a sales team. In today’s competitive market, marketing, advertising and sales teams cannot afford to wait to be outstripped by the competition. They must begin to court and engage a customer before the customer has the need for an item. By building brand awareness and improving product and service visibility, the sales team can work seamlessly throughout the marketing and sales team channel to educate, and enlighten prospects and then carry them through the process to close the deal. To do that, the sales staff must have a comprehensive understanding of buying behaviors, current issues with existing products, pricing points and the impact of changing prices, products or distribution channels. With access to data integrated from CRM, ERP, warehousing, supply chain management, and other functions and data sources, a sales manager and sales team can create personalized business intelligence dashboards to guide them through the process and to help them analyze and understand trends and patterns before the competition strikes.

The enterprise must monitor sales results at the international, national, regional, local, team and individual sales professional. As a sales manager, you should be able to manage incentives and set targets with complete confidence, and provide accurate sales forecasts and predictions to ensure that the enterprise consistently meets its goals and can depend on the predicted revenue and profits for investment, new product development, market expansion and resource acquisition.

Business Intelligence for the sales function must include Key Performance Indicators (KPI) to help the team manage each role and be accountable for objectives and goals. If a sales region fails to meet the established plan, the business can quickly ascertain the root cause of the issue, whether it is product dissatisfaction, poor sales performance, or any one of a number of other sources.

Since the demand generated by the  sales force management directly affects the production cycle and plan, the sales team must monitor sales targets and objectives with product capacity and production to ensure that they can satisfy the customer without shortfalls or back orders. If some customers are behind on product payments, a business must be able to identify the source of the issue and address that issue before it results in decreased revenue and results.

The ten benefits listed below comprise a set of ‘must haves’ for every sales team considering a business intelligence solution:

  1. Set targets and allocate resources based on authentic data, rather than speculation
  2. Establish, monitor and adapt accurate forecasts and budgets based on up-to-date, verified data and objective KPIs
  3. Analyze current data, and possible cross-sell and up-sell revenue paths and the estimated lifetime value of a customer
  4. Analyze the elements of sales efforts (prospecting, up-selling, discounts, channel partners, sales collaterals, presentations) and adapt processes that do not provide a competitive edge and strong customer relationships and client loyalty
  5. Measure the factors affecting sales effectiveness to improve sales productivity and correct strategies that do not work
  6. Achieve a consistent view of sales force performance, with a clear picture of unexpected variations in sales and immediate corrective action and strategic adjustment based on trends and patterns
  7. Understand product profitability and customer behavior, by spotlighting customers and products with the highest contribution to the bottom line
  8. Revise expense and resource allocation using the net value of each customer segment or product group
  9. Identify the most effective sales tactics and mechanisms, and the best resources and tools, to meet organizational sales objectives
  10. Establish a personalized, automated alert system to identify and monitor upcoming opportunities and threats

When the enterprise provides a single source, integrated view of enterprise data from numerous sources and enables every user to build views, dashboards and KPIs, every member of the sales team is engaged in the pursuit of strategic, operational and tactical goals. In this way, the enterprise can acquire new clients, retain existing clients, and sell new products and services without a misstep.

Read more…

Guest blog post by SupStat

Contributed by Sharan Duggal.  You can find the original article here.


We know that war and civil unrest account for a significant proportion of deaths every year, but how much can mortality rates be attributed to a simple lack of basic resources and amenities, and what relationship do mortality rates have with such factors? That’s what I set out to uncover using WorldBank data that covers the globe for up to the last 50 odd years, and I found a strong relationship with some of the available data.

If you were to look at overall mortality rates, the numbers would be muddied by several factors, including the aforementioned causes of death, so I decided to look at two related, but more specific outcome variables – infant mortality as well as risk of maternal death.

Infant mortality is defined as the number of infants dying before reaching one year of age, per 1,000 live births in a given year.

Lifetime risk of maternal death is the probability that a 15-year-old female will die eventually from a maternal cause assuming that current levels of fertility and mortality (including maternal mortality) do not change in the future, taking into account competing causes of death.

While I am sure these numbers can also be impacted by things like civil unrest, it does focus on individuals who are arguably more subject to be impacted by things like communicable diseases and lack of basic provisions like clean water, electricity or adequate medical resources, among others.

So, what do overall mortality rates even look like?

The density plot below includes the overall infant mortality distribution along with some metrics indicating the availability of key resources. Infant mortality rates peak at around 1% and the availability of resources peak closer to 100%. In both cases we see really long tails, indicating that there is a portion of the population experiencing less than ideal numbers.

So to drill down further, let’s have a closer look at the distribution of both outcome variables by year. The boxplots below suggest that both Infant mortality rates as well as risk of maternal death have shown not only steady overall improvements over the years but also a reduction in the disparity of cases across country-specific observations. But the upper end of these distributions still represent shocking numbers for some countries with: over 10% of infants dying every year (down from a high of 24% in 1961) and a 7.5% probability that a 15 year old girl living today will eventually die of a maternal cause (down from over 15% twenty-five years ago).

Please note: points have been marginally jittered above for clearer visual representation

Mortality Rates across the Globe

The below map plots the 2012 distribution of infant mortality rates by country. I chose 2012 because most of the covariates I would eventually like to use contain the best information from this year, with a couple of exceptions. It also presents a relatively recent picture of the variables of interest.

As can be seen, the world is distinctly divided, with many African, and some South Asian, countries bearing a bigger burden of infant mortality. And if it wasn’t noticeable on the previous boxplot, the range of values, as shown in the scale below is particularly telling of the overall disparity of mortality rates, pointing to a severe imbalance across the world.

The map representing the risk of maternal death is almost identical, and as such has been represented in a different color for differentiation. Here, the values range from close to 0% to over 7%.

Bottom Ranked Countries Over the Years

After factoring in all 50+ years of data for infant mortality and 26 years of data for risk of maternal death, and then ranking countries, the same set of countries feature at the bottom of the list.

The below chart looks at the number of times a country has had one of the worst three infant mortality rates in any given year since 1960.

The chart for maternal data goes from 1990 through to 2015. It’s important to note that Chad and Sierra Leone were ranked in the bottom 3 for maternal risk of death in every year since 1990.

Please note that numbers may be slightly impacted by missing data for some countries, especially for earlier years in the data set.

Relationship between Mortality & Resources

Getting back to the original question, are there any low hanging fruit and easy fixes for such a dichotomous situation? While my efforts during this analysis did not include any regressions, I did want to get an initial understanding of whether the availability of basic resources had a strong association with mortality rates, and if such a relationship existed, which provisions were more strongly linked with these outcomes? The findings could serve as a platform to do further research.

The below correlation analysis helped home in on some of the stronger linkages and helped weed out some of the weaker ones.

Note, the correlation analysis was run using 2012 data for all metrics, except for “Nurses and Midwives (per 1000 people)” and “Hospital beds (per 1000 people)” for which 2010 and 2009 data was used respectively, due to poorer availability of 2012 data for these measures.


Focusing on the first two columns of the above correlation plot, which represent risk of maternal death and infant mortality, we see a very similar pattern across the variables included in the analysis. Besides basic resources, I had also included items like availability of renewable freshwater resources and land area, to see if naturally available resources had any linkages to the outcomes in question. They didn’t and so they were removed from the analysis. In the plot above, it can also be seen that average rainfall and population density dont have much of a relationship with the mortality rates in question. What was also surprising was that access to anti-retroviral therapy too had a weak correlation with mortality rates in general.

The metrics that had the strongest relationship (in the 0.75 to 0.85 range) were:

  • Percent of population with electricity
  • Percent of population with access to non-solid fuel
  • Percent of population with access to improved sanitation facilities, and
  • Percent of population with access to improved water sources

The first two require no definitional explanation, but access to improved sanitation facilities ensure the hygienic separation of human excreta from human contact. Access to improved water sources refers to the percentage of the population using an improved drinking water source including piped water on premises and other improved drinking water sources (public taps or standpipes, tube wells or boreholes, protected dug wells, protected springs, and rainwater collection).

Analyzing the strongly correlating factors by Region

The following 4 charts look at regional performance of the key identified metrics. The pattern follows the same as that seen on the static world map from 2012, but this also gives us a view into how things have been trending on the resources that seem to be strongly linked with infant and maternal mortality over the past 25 years. We see a fairly shallow slope for Sub-saharan Africa on access to non-solid fuel as well as on improved sanitation facilities. Improvements in drinking water access have been much better.

South Asian countries ranked lowest on the provision of sanitation facilities in the early ’90s, but have made improvements since.


My analysis found a very strong relationship between mortality rates and basic provisions. It also weeded out some factors which were less important. As a next step, it may be helpful to do a deeper country-specific analysis for African and South Asian nations that suffer from a chronic lack of basic infrastructure, to see where investments would be most fruitful in bringing these countries to a closer state of parity with the developed world.

Read more…

Originally posted on Data Science Central

“I keep saying that the sexy job in the next 10 years will be statisticians, and I’m not kidding”

Hal Varian, Chief Economist at Google and emeritus professor at the University of California, better known as Berkeley, said on the 5th of August 2009.

Today, what Hal Varian said almost seven years ago has been confirmed, as is highlighted in the following graph taken from Google Trends, which gives a good idea of the current attention to figure of the Data Scientist.

The Observatory for Big Data Analytics & BI of Politecnico di Milano has been working on the theme of Data Scientists for a few years, and has now prepared a survey to be submitted to Data Scientists that will be used to create a picture of the Data Scientist, within their company and the context in which they operate.

If you work with data in your company, please support us in our research and take this totally anonymous survey here. Thank-you from the Observatory for Big Data Analytics & BI.


Graph 1: How many times the term "Data Scientist" has been searched on Google. The numbers in the graph represent the searched term in relation to the highest point in the graph. The value of 100 is given to the point with the maximum number of searches, the others values are proportional.

Mike Loukides, VP of O’Reilly Media, summarized the Data Scientist’s job description in these words:

"Data scientists are involved with gathering data, massaging it into a tractable form, making it tell its story, and presenting that story to others."

We are in the era of Big Data, in an era where 2,5 quintillion (10^18) of bytes are generated every day. Both the private and public sector everywhere are adapting so that they can exploit the potential of Big Data by introducing into their organizations people who are able to extract information from data.

Getting information out of data is of increasing importance because of huge amount of data available. As Daniel Keys Moran, programmer and science fiction writer, said:“You can have data without information, but you cannot have information without data”.

In companies today, we are seeing positions like the CDO (Chief Data Officer) andData Scientists more often than we were used to.

The CDO is a business leader, typically a member of the organization’s executive management team, who defines and executes analytics strategies. This is the person actually responsible for defining and developing the strategies that will direct the company’s processes of data acquisition, data management, data analysis and data governance. This means that new governance roles and new professional figures have been introduced in many organizations to exploit what Big Data offer them in terms of opportunities.

According to the report on “Big Success with Big Data” (Accenture, 2014), 89% of companies believe that, without a big data analytics strategy, in 2015 they risk losing market share and will no longer be competitive.

Collecting data is not simply retrieving information: the Data Scientists’ role is to translate data into information, and currently there is a dearth of people with this set of skills.

It may seem controversial, but both companies and Data Scientists know very little about what skills are needed. They are operating in a turbulent environment where frequent monitoring is needed to know who actually uses which tools, which tools are considered old and becoming obsolete, and which are those used by the highest and lowest earners. According to a study by RJMetrics (2015), the Top 20 Skills of a Data Scientist are those contained in the following graph. 

The graph clearly shows the importance of tools and programming languages such as Rand PythonMachine LearningData Mining and Statistics are also high up in the set of most requested skills. Those relating to Big Data are at about the 15th place.

The most recent research on Data Scientists showed that these professionals are more likely to be found in companies belonging to the ICT sectorinternet companies andsoftware vendors, such as Microsoft and IBM, rather than in social networks(Facebook, LinkedIn, Twitter) AirbnbNetflix etc. The following graph, provided – like the previous one - by RJMetrics, gives the proportion of Data Scientists by industry.

It is important to keep monitoring Data Scientists throughout industrial sectors, their diffusion and their main features, because, in the unsettled business world of today, we can certainly expect a great many changes to take place while companies become aware, at different times and in different ways, of the importance of Data Scientists

Read more…

Shopper Marketing -Infographic

Originally posted on Data Science Central

This infographic on Shopper Marketing was created by Steve Hashman and his team. Steve is Director at Exponential Solutions (The CUBE) Marketing. 

Shopper marketing focuses on the customer in and at the point of purchase. It is an integrated and strategic approach to a customer’s in-store experience which is seen as a driver of both sales and brand equity.

For more information, click here

DSC Resources

Additional Reading

Follow us on Twitter: @DataScienceCtrl | @AnalyticBridge

Read more…

Guest blog post by Vimal Natarajan


The City and County of San Francisco had launched an official open data portal called SF OpenData in 2009 as a product of its official open data program, DataSF. The portal contains hundreds of city datasets for use by developers, analysts, residents and more. Under the category of Public Safety, the portal contains the list of SFPD Incidents since Jan 1, 2003.
In this post I have done an exploratory time-series analysis on the crime incidents dataset to see if there are any patterns.


The data for this analysis was downloaded from the publicly available dataset from the City and County of San Francisco’s OpenData website SF OpenData. The crime incidents dataset has data recorded from the year 2003 till date. I downloaded the full dataset and performed my analysis for the time period from 2003 to 2015, filtering out the data from the year 2016. There are nearly 1.9 million crime incidents in this dataset.
I have performed minimal data processing on the downloaded raw data to facilitate my analysis.


Crimes by Year

The following plot depicts the crimes recorded from the year 2003 till the end of the year 2015.

The horizontal line represents the average number of crimes during those years, which is just below 150,000 crimes per year. As you can observe from the year 2003 till 2007 the number of crime incidents decreased steadily. But in the year 2008 and 2009 there was a slight increase in the number of crime incidents. These two years is when the United States went through the financial and subprime mortgage crisis resulting in what is called as the Great Recession. According to the US National Bureau of Economic Research the recession began around January 2008 and ended around June 2009. As most statisticians say, “Correlation does not imply causation”, I too want to emphasize that without additional data and insights from its related analysis it may be not possible to relate these two events, but nevertheless it is an interesting observation. Following that period, there was a slight decrease in the crime incidents during the next two years but it has increased since 2012 ending up above average from year 2013 to year 2015.

Mean crimes by month

The following plot depicts the mean crimes for each month from January till December. You can observe that the mean crime for each month is more or less around the monthly average which is just below 12,000 (horizontal line). One interesting observation is that the mean crime is significantly below the monthly average for the months of February, November and December. The possible reasons could be that the month of February has less number of days compared to the other months and the festive and holiday season during the months of November and December.

Mean crimes by day of the month

The following plot depicts the mean crimes for the different days of the month.You can observe that the mean crime for each day of the month is pretty much around the daily average which is just below 400 (horizontal line) for the days from the 2nd of the month till the 28th. The mean crime during the first day of the month is significantly above average. One possible reason could be that the first day of the month is usually the pay day. Again, a correlation does not imply causation. Without additional related data and insights derived from the analysis of that data we cannot be sure. The 29th and 30th are also below average and the reason could be that the month of February does not have those days. The mean crime for the 31st of the month is around half of the daily average and that might be due to the reason that only half of the months in a year has the 31st day.

Mean crimes by hour of the day

The following plot depicts the mean crimes by the hour of the day.You can observe that this plot is very different from the other plots in the sense that the crime incidents are far from the hourly average which is around 16 (horizontal line). But within this plot you can observe some interesting pattern like the fact that crime incidents are well above average around midnight and decline steadily and significantly below the hourly average till early morning around 5 AM. From the early morning hours starting at 6 AM you can observe that the crime incidents steadily increase and spikes around noon. From noon, it is well above average peaking around 6 PM in the evening and then declining after 6 PM.

Mean crimes by day of the week

The following plot depicts the mean crimes by the day of the week.As you can observe, Sunday has the least number of crime incidents, well below the daily average which is just below 400 (vertical line) and Friday has the most number of crime incidents well above the daily average.

Mean crimes during holidays

The following plot depicts the mean crimes during few key days like holidays in the United States.You can observe here that the number of crime incidents is significantly high during the New Year, well above the daily average which is just below 400 (horizontal line). During the other holidays the number of crime incidents is more or less same as the daily average but during the Christmas Eve and the day of Christmas the number of crime incidents is significantly lower than the daily average. Since Thanksgiving Day falls in different dates each year, as an approximation I chose the date of November 24 here. I was expecting to see significantly lower crime incidents during this time period, but it does not seem to be the case.


In conclusion, based on the above observation, we can see some patterns in the crime incidents and arrive at the following conclusions:

  • The average number of crime incidents happening daily in the City and County of San Francisco is around 400.
  • The number of crime incidents is highest around midnight and lowest at the early morning hours.
  • The number of crime incidents is usually lower during Christmas.
  • The number of crime incidents has been slowly increasing in the recent years.
  • The number of crime incidents is high during New Year day and at the beginning of every month.

The above is just a high-level exploratory time-series analysis. With further in-depth analysis it is possible to arrive at more insights. In my future posts I will try to perform those analyses.


This analysis was performed entirely using RStudio version 0.99 and R Version 3.2.0.
The data processing and plots were done using the R libraries ggplot2 and dplyr.

Read more…

Guest blog post by Dante Munnis

Digging through messy data and doing numerous calculations just so you can submit a report or arrive at the result of your quarterly business development can sometimes be nigh impossible. After all, we are only human, and by the time we get to the other side of our spreadsheet equation, we have lost all sense of what we were trying to accomplish.

Luckily, there are data visualization and analysis tools out there that can do most of the heavy lifting for us. Remember, you will still need to do some of the work yourself, but putting it all together will become that much simpler. Let’s take a look at some of the best data analysis tools at our disposal.

1. Open Refine

At first, you might be surprised as to how much Open Refine resembles Google’s own Spread Sheets. This is because it started as a Google project but quickly became crowd sourced and independent. In practice, this means that Open Refine has all the built-in algorithms and formulas that you might need for your business data analysis.

Keep in mind that while it does resemble Spread Sheets, it doesn’t have the regular features you would expect, such as manual cell manipulation and custom algorithms. You would need to export your data and bring it back in. If that doesn’t cause too much headache, you might want to give Open Refine a shot.

2. Data Wrangler

Stanford University’s own data analysis tool is open to public use. While text manipulation and web-based interface is certainly a plus, you might consider the other factors as well. Some of the formulas provided as default don’t work really well with large amounts of data, often giving off false results or downright crashing the tool. While easy and accessible to use, Data Wrangler might not be a good tool for internal and sensitive data, since all of the data is stored at Stanford for research purposes.

3. Rapid Miner

As one of the best data visualization tools out there, Rapid Miner had to find its way to our list. It can not only manipulate and calculate custom data, analyze the required results but also model and visualize the results. This award-winning tool is known to provide great results no matter the data you are trying to analyze.

The near-perfect visualization system is just an added bonus considering everything that you are getting. If you need a tool that can help you lead and develop projects with coworkers that are less than adept at analysis, Rapid Miner is the perfect tool for the job.

4. Wolfram Alpha

Ever wonder how it would feel like if you had a personal computing assistant? Wolfram Alpha is exactly one such platform. Think of Google Search but for business analytics and data research. Whatever your field of work and specifics needs, you can be sure that Wolfram will make sense of it and help you decode any problems that you might be experiencing.

5. Solver

Sometimes you don’t need external apps or web services for your data analysis. Solver is one such addition to your Excel spread sheets. Offering a vast variety of optimization and programming algorithms, Solver will help you make sense of your data at a much faster rate than you otherwise would. It’s light, fast and easy to use, so there’s no reason not to give it a shot. Keep in mind that Solver won’t be able to make sense of more complex and demanding analysis tasks, so make sure that you use it in a smart way.

6. Google Fusion Tables

While it may not be the most versatile or complex tool on the web, Fusion Tables is one of the most accessible data visualization services out there. The best thing about it is that it is free to use and very approachable, so there’s no need for spending hours on end learning about what’s what. You can visualize your data in any shape or form you desire. Just keep in mind that you can use this tool of simple calculations and not vast sprawling data analysis tasks.

7. Zoho Reports

You might have heard about Zoho, since it’s one of the most popular business data analysis tools on the web. It’s fairly easy to get into and use, requiring only a simple log-in and data input. Use Zoho to quickly and professionally turn your data into charts, tables and pivots in order to use them for further research.

8. NodeXL

Taking the best from both worlds, Node XL is simple to use and fairly advanced in it’s algorithm possibilities. You can not only analyze and visualize raw data, but use it to develop and visualize networks and relations between different results. While some of the features might be too advanced for everyday data analysis, NodeXL is the perfect tool for more complex tasks.

9. Google Chart Tools

Another Google tool on our list that provides visualization and analysis but doesn’t focus on raw data. Instead, you can point the tool at different sources on the web and make ends meet in the visualized charts, analyzing outsourced data in order to get the results that you need. While it’s very useful and provides accurate data, Google Chart Tools isn’t very user friendly, requiring a bit of programming knowledge in order to fully utilize it’s capabilities.

10. Time Flow

Data analysis sometimes requires a different kind of visualization. Time Flow is a tool that can analyze and visualize time points and create a data map that provides a clear picture of how and when your specific data developed. While it does sound complex, the tool itself is fairly easy to use and allows a plethora of customization options. Use Time Flow whenever you need to create timelines and streamline your data.

About author: Dante Munnis is a media and marketing expert currently working at Essay Republic. He shares ideas and experience on how to build your brand and attract more customers

Read more…

Originally posted on Data Science Central

Things, not Strings
Entity-centric views on enterprise information and all kinds of data sources provide means to get a more meaningful picture about all sorts of business objects. This method of information processing is as relevant to customers, citizens, or patients as it is to knowledge workers like lawyers, doctors, or researchers. People actually do not search for documents, but rather for facts and other chunks of information to bundle them up to provide answers to concrete questions.

Strings, or names for things are not the same as the things they refer to. Still, those two aspects of an entity get mixed up regularly to nurture the Babylonian language confusion. Any search term can refer to different things, therefore also Google has rolled out its own knowledge graph to help organizing information on the web at a large scale.

Semantic graphs can build the backbone of any information architecture, not only on the web. They can enable entity-centric views also on enterprise information and data. Such graphs of things contain information about business objects (such as products, suppliers, employees, locations, research topics, …), their different names, and relations to each other. Information about entities can be found in structured (relational databases), semi-structured (XML), and unstructured (text) data objects. Nevertheless, people are not interested in containers but in entities themselves, so they need to be extracted and organized in a reasonable way.

Machines and algorithms make use of semantic graphs to retrieve not only simply the objects themselves but also the relations that can be found between the business objects, even if they are not explicitly stated. As a result, ‘knowledge lenses’ are delivered that help users to better understand the underlying meaning of business objects when put into a specific context.

Personalization of information
The ability to take a view on entities or business objects in different ways when put into various contexts is key for many knowledge workers. For example, drugs have regulatory aspects, a therapeutical character, and some other meaning to product managers or sales people. One can benefit quickly when only confronted with those aspects of an entity that are really relevant in a given situation. This rather personalized information processing has heavy demand for a semantic layer on top of the data layer, especially when information is stored in various forms and when scattered around different repositories.

Understanding and modelling the meaning of content assets and of interest profiles of users are based on the very same methodology. In both cases, semantic graphs are used, and also the linking of various types of business objects works the same way.

Recommender engines based on semantic graphs can link similar contents or documents that are related to each other in a highly precise manner. The same algorithms help to link users to content assets or products. This approach is the basis for ‘push-services’ that try to ‘understand’ users’ needs in a highly sophisticated way.

‘Not only MetaData’ Architecture
Together with the data and content layer and its corresponding metadata, this approach unfolds into a four-layered information architecture as depicted here.

Following the NoSQL paradigm, which is about ‘Not only SQL’, one could call this content architecture ‘Not only Metadata’, thus ‘NoMeDa’ architecture. It stresses the importance of the semantic layer on top of all kinds of data. Semantics is no longer buried in data silos but rather linked to the metadata of the underlying data assets. Therefore it helps to ‘harmonize’ different metadata schemes and various vocabularies. It makes the semantics of metadata, and of data in general, explicitly available. While metadata most often is stored per data source, and therefore not linked to each other, the semantic layer is no longer embedded in databases. It reflects the common sense of a certain domain and through its graph-like structure it can serve directly to fulfill several complex tasks in information management:

  • Knowledge discovery, search and analytics
  • Information and data linking
  • Recommendation and personalization of information
  • Data visualization

Graph-based Data Modelling
Graph-based semantic models resemble the way how human beings tend to construct their own models of the world. Any person, not only subject matter experts, organize information by at least the following six principles:

  1. Draw a distinction between all kinds of things: ‘This thing is not that thing’
  2. Give things names: ‘This thing is my dog Goofy’ (some might call it Dippy Dawg, but it’s still the same thing)
  3. Categorize things: ‘This thing is a dog but not a cat’
  4. Create general facts and relate categories to each other: ‘Dogs don’t like cats’
  5. Create specific facts and relate things to each other: ‘Goofy is a friend of Donald’, ‘Donald is the uncle of Huey, Dewey, and Louie’, etc.
  6. Use various languages for this; e.g. the above mentioned fact in German is ‘Donald ist der Onkel von Tick, Trick und Track’ (remember: the thing called ‘Huey’ is the same thing as the thing called ‘Tick’ - it’s just that the name or label for this thing that is different in different languages).

These fundamental principles for the organization of information are well reflected by semantic knowledge graphs. The same information could be stored as XML, or in a relational database, but it’s more efficient to use graph databases instead for the following reasons:

  • The way people think fits well with information that is modelled and stored when using graphs; little or no translation is necessary.
  • Graphs serve as a universal meta-language to link information from structured and unstructured data.
  • Graphs open up doors to a better aligned data management throughout larger organizations.
  • Graph-based semantic models can also be understood by subject matter experts, who are actually the experts in a certain domain.
  • The search capabilities provided by graphs let you find out unknown linkages or even non-obvious patterns to give you new insights into your data.
  • For semantic graph databases, there is a standardized query language called SPARQL that allows you to explore data.
  • In contrast to traditional ways to query databases where knowledge about the database schema/content is necessary, SPARQL allows you to ask “tell me what is there”.

Standards-based Semantics
Making the semantics of data and metadata explicit is even more powerful when based on standards. A framework for this purpose has evolved over the past 15 years at W3C, the World Wide Web Consortium. Initially designed to be used on the World Wide Web, many enterprises have been adopting this stack of standards for Enterprise Information Management. They now benefit from being able to integrate and link data from internal and external sources with relatively low costs.

At the base of all those standards, the Resource Description Framework (RDF) serves as a ‘lingua franca’ to express all kinds of facts that can involve virtually any kind of category or entity, and also all kinds of relations. RDF can be used to describe the semantics of unstructured text, XML documents, or even relational databases. The Simple Knowledge Organization System (SKOS) is based on RDF. SKOS is widely used to describe taxonomies and other types of controlled vocabularies. SPARQL can be used to traverse and make queries over graphs based on RDF or standard schemes like SKOS.

With SPARQL, far more complex queries can be executed than with most other database query languages. For instance, hierarchies can be traversed and aggregated recursively: a geographical taxonomy can then be used to find all documents containing places in a certain region although the region itself is not mentioned explicitly.

Standards-based semantics also helps to make use of already existing knowledge graphs. Many government organisations have made available high-quality taxonomies and semantic graphs by using semantic web standards. These can be picked up easily to extend them with own data and specific knowledge.

Semantic Knowledge Graphs will grow with your needs!
Standards-based semantics provide yet another advantage: it is becoming increasingly simpler to hire skilled people who have been working with standards like RDF, SKOS or SPARQL before. Even so, experienced knowledge engineers and data scientists are a comparatively rare species. Therefore it’s crucial to grow graphs and modelling skills over time. Starting with SKOS and extending an enterprise knowledge graph over time by introducing more schemes and by mapping to other vocabularies and datasets over time is a well established agile procedure model.

A graph-based semantic layer in enterprises can be expanded step-by-step, just like any other network. Analogous to a street network, start first with the main roads, introduce more and more connecting roads, classify streets, places, and intersections by a more and more distinguished classification system. It all comes down to an evolving semantic graph that will serve more and more as a map of your data, content and knowledge assets.

Semantic Knowledge Graphs and your Content Architecture
It’s a matter of fact that semantics serves as a kind of glue between unstructured and structured information and as a foundation layer for data integration efforts. But even for enterprises dealing mainly with documents and text-based assets, semantic knowledge graphs will do a great job.

Semantic graphs extend the functionality of a traditional search index. They don’t simply annotate documents and store occurrences of terms and phrases, they introduce concept-based indexing in contrast to term based approaches. Remember: semantics helps to identify the things behind the strings. The same applies to concept-based search over content repositories: documents get linked to the semantic layer, and therefore the knowledge graph can be used not only for typical retrieval but to classify, aggregate, filter, and traverse the content of documents.

PoolParty combines Machine Learning with Human Intelligence
Semantic knowledge graphs have the potential to innovate data and information management in any organisation. Besides questions around integrability, it is crucial to develop strategies to create and sustain the semantic layer efficiently.

Looking at the broad spectrum of semantic technologies that can be used for this endeavour, they range from manual to fully automated approaches. The promise to derive high-quality semantic graphs from documents fully automatically has not been fulfilled to date. On the other side, handcrafted semantics is error-prone, incomplete, and too expensive. The best solution often lies in a combination of different approaches. PoolParty combines Machine Learning with Human Intelligence: extensive corpus analysis and corpus learning support taxonomists, knowledge engineers and subject matter experts with the maintenance and quality assurance of semantic knowledge graphs and controlled vocabularies. As a result, enterprise knowledge graphs are more complete, up to date, and consistently used.

“An Enterprise without a Semantic Layer is like a Country without a Map.”


Read more…

Originally posted on Data Science Central

Original post published to DataScience+

In this post I will show how to collect data from a webpage and to analyze or visualize in R. For this task I will use the rvest package and will get the data from Wikipedia. I got the idea to write this post from Fisseha Berhane.
I will gain access to the prevalence of obesity in United States from Wikipedia page, then I will plot it in the map. Lets begin with loading the required packages.


Download the data from Wikipedia.

obesity = read_html("")
obesity = obesity %>%
html_nodes("table") %>%

The first line of code is calling the data from Wikipedia and the second line of codes is transforming the table that we are interested into dataframe in R.
The head of our data.

State and District of Columbia Obese adults Overweight (incl. obese) adults
1 Alabama 30.1% 65.4%
2 Alaska 27.3% 64.5%
3 Arizona 23.3% 59.5%
4 Arkansas 28.1% 64.7%
5 California 23.1% 59.4%
6 Colorado 21.0% 55.0%
Obese children and adolescents Obesity rank
1 16.7% 3
2 11.1% 14
3 12.2% 40
4 16.4% 9
5 13.2% 41
6 9.9% 51

The dataframe looks good, now we need to clean it from making ready to plot.

'data.frame': 51 obs. of 5 variables:
$ State and District of Columbia : chr "Alabama" "Alaska" "Arizona" "Arkansas" ...
$ Obese adults : chr "30.1%" "27.3%" "23.3%" "28.1%" ...
$ Overweight (incl. obese) adults: chr "65.4%" "64.5%" "59.5%" "64.7%" ...
$ Obese children and adolescents : chr "16.7%" "11.1%" "12.2%" "16.4%" ...
$ Obesity rank : int 3 14 40 9 41 51 49 43 22 39 ...

# remove the % and make the data numeric
for(i in 2:4){
obesity[,i] = gsub("%", "", obesity[,i])
obesity[,i] = as.numeric(obesity[,i])
# check data again
'data.frame': 51 obs. of 5 variables:
$ State and District of Columbia : chr "Alabama" "Alaska" "Arizona" "Arkansas" ...
$ Obese adults : num 30.1 27.3 23.3 28.1 23.1 21 20.8 22.1 25.9 23.3 ...
$ Overweight (incl. obese) adults: num 65.4 64.5 59.5 64.7 59.4 55 58.7 55 63.9 60.8 ...
$ Obese children and adolescents : num 16.7 11.1 12.2 16.4 13.2 9.9 12.3 14.8 22.8 14.4 ...
$ Obesity rank : int 3 14 40 9 41 51 49 43 22 39 ...

Fix the names of variables by removing the spaces.

[1] "State and District of Columbia" "Obese adults"
[3] "Overweight (incl. obese) adults" "Obese children and adolescents"
[5] "Obesity rank"

names(obesity) = make.names(names(obesity))
[1] "State.and.District.of.Columbia" "Obese.adults"
[3] "Overweight..incl..obese..adults" "Obese.children.and.adolescents"
[5] "Obesity.rank"

Now, it's time to load the map data.

# load the map data
states = map_data("state")
'data.frame': 15537 obs. of 6 variables:
$ long : num -87.5 -87.5 -87.5 -87.5 -87.6 ...
$ lat : num 30.4 30.4 30.4 30.3 30.3 ...
$ group : num 1 1 1 1 1 1 1 1 1 1 ...
$ order : int 1 2 3 4 5 6 7 8 9 10 ...
$ region : chr "alabama" "alabama" "alabama" "alabama" ...
$ subregion: chr NA NA NA NA ...

Merge two datasets (obesity and states) by region, therefore we first need to create a new variable (region) in obesity dataset.

# create a new variable name for state
obesity$region = tolower(obesity$State.and.District.of.Columbia)

Merge the datasets.

states = merge(states, obesity, by="region", all.x=T)
'data.frame': 15537 obs. of 11 variables:
$ region : chr "alabama" "alabama" "alabama" "alabama" ...
$ long : num -87.5 -87.5 -87.5 -87.5 -87.6 ...
$ lat : num 30.4 30.4 30.4 30.3 30.3 ...
$ group : num 1 1 1 1 1 1 1 1 1 1 ...
$ order : int 1 2 3 4 5 6 7 8 9 10 ...
$ subregion : chr NA NA NA NA ...
$ State.and.District.of.Columbia : chr "Alabama" "Alabama" "Alabama" "Alabama" ...
$ Obese.adults : num 30.1 30.1 30.1 30.1 30.1 30.1 30.1 30.1 30.1 30.1 ...
$ Overweight..incl..obese..adults: num 65.4 65.4 65.4 65.4 65.4 65.4 65.4 65.4 65.4 65.4 ...
$ Obese.children.and.adolescents : num 16.7 16.7 16.7 16.7 16.7 16.7 16.7 16.7 16.7 16.7 ...
$ Obesity.rank : int 3 3 3 3 3 3 3 3 3 3 ...

Plot the data

Finally we will plot the prevalence of obesity in adults.

# adults
ggplot(states, aes(x = long, y = lat, group = group, fill = Obese.adults)) +
geom_polygon(color = "white") +
scale_fill_gradient(name = "Percent", low = "#feceda", high = "#c81f49", guide = "colorbar", na.value="black", breaks = pretty_breaks(n = 5)) +
labs(title="Prevalence of Obesity in Adults") +

Here is the plot in adults:
Similarly, we can plot the prevalence of obesity in children.

# children
ggplot(states, aes(x = long, y = lat, group = group, fill = Obese.children.and.adolescents)) +
geom_polygon(color = "white") +
scale_fill_gradient(name = "Percent", low = "#feceda", high = "#c81f49", guide = "colorbar", na.value="black", breaks = pretty_breaks(n = 5)) +
labs(title="Prevalence of Obesity in Children") +

Here is the plot in children:
If you like to show the name of State in the map use the code below to create a new dataset.

statenames = states %>% 
group_by(region) %>%
long = mean(range(long)),
lat = mean(range(lat)),
group = mean(group),
Obese.adults = mean(Obese.adults),
Obese.children.and.adolescents = mean(Obese.children.and.adolescents)

After you add this code to ggplot code above

geom_text(data=statenames, aes(x = long, y = lat, label = region), size=3)

That's all. I hope you learned something useful today.

Read more…

Guest blog post by Klodian

Original post is published at DataScience+

Recently, I become interested to grasp the data from webpages, such as Wikipedia, and to visualize it with R. As I did in my previous post, I use rvest package to get the data from webpage and ggplot package to visualize the data.
In this post, I will map the life expectancy in White and African-American in US.
Load the required packages.


Import the data from Wikipedia.

le = read_html("")
le = le %>%
html_nodes("table") %>%

Now I have to clean the data. Below I have explain the role of each code.

# select only columns with data
le = le[c(1:8)]
# get the names from 3rd row and add to columns
names(le) = le[3,]
# delete rows and columns which I am not interested
le = le[-c(1:3), ]
le = le[, -c(5:7)]
# rename the names of 4th and 5th column
names(le)[c(4,5)] = c("le_black", "le_white")
# make variables as numeric
le = le %>%
le_black = as.numeric(le_black),
le_white = as.numeric(le_white))

Since there are some differences in life expectancy between White and African-American, I will calculate the differences and will map it.

le = le %>% mutate(le_diff = (le_white - le_black))

I will load the map data and will merge the datasets togather.
states = map_data("state")
# create a new variable name for state
le$region = tolower(le$State)
# merge the datasets
states = merge(states, le, by="region", all.x=T)

Now its time to make the plot. First I will plot the life expectancy in African-American in US. For few states we don't have the data, and therefore I will color it in grey color.
# Life expectancy in African American
ggplot(states, aes(x = long, y = lat, group = group, fill = le_black)) +
geom_polygon(color = "white") +
scale_fill_gradient(name = "Years", low = "#ffe8ee", high = "#c81f49", guide = "colorbar", na.value="#eeeeee", breaks = pretty_breaks(n = 5)) +
labs(title="Life expectancy in African American") +

Here is the plot:

The code below is for White people in US.

# Life expectancy in White American
ggplot(states, aes(x = long, y = lat, group = group, fill = le_white)) +
geom_polygon(color = "white") +
scale_fill_gradient(name = "Years", low = "#ffe8ee", high = "#c81f49", guide = "colorbar", na.value="Gray", breaks = pretty_breaks(n = 5)) +
labs(title="Life expectancy in White") +

Here is the plot:

Finally, I will map the differences between white and African American people in US.

# Differences in Life expectancy between White and African American
ggplot(states, aes(x = long, y = lat, group = group, fill = le_diff)) +
geom_polygon(color = "white") +
scale_fill_gradient(name = "Years", low = "#ffe8ee", high = "#c81f49", guide = "colorbar", na.value="#eeeeee", breaks = pretty_breaks(n = 5)) +
labs(title="Differences in Life Expectancy between \nWhite and African Americans by States in US") +

Here is the plot:
On my previous post I got a comment to add the pop-up effect as I hover over the states. This is a simple task as Andrea exmplained in his comment. What you have to do is to install the plotly package, to create a object for ggplot, and then to use this function ggplotly(map_plot) to plot it.

map_plot = ggplot(states, aes(x = long, y = lat, group = group, fill = le_black)) +
geom_polygon(color = "white") +
scale_fill_gradient(name = "Years", low = "#ffe8ee", high = "#c81f49", guide = "colorbar", na.value="#eeeeee", breaks = pretty_breaks(n = 5)) +
labs(title="Life expectancy in African American") +

Here is the plot:
Thats all! Leave a comment below if you have any question.

Original post: Map the Life Expectancy in United States with data from Wikipedia

Read more…

13 Great Data Science Infographics

Originally posted on Data Science Central

Most of these infographics are tutorials covering various topics in big data, machine learning, visualization, data science, Hadoop, R or Python, typically intended for beginners. Some are cheat sheets and can be nice summaries for professionals with years of experience. Some, popular a while back (you will find one example here) were designed as periodic tables.

For Geeks 

For Business People

Infographics Repositories

Read more…

Twitter Analytics using Tweepsmap

Guest blog post by Salman Khan

This morning I saw #tweepsmap on my twitter feed and decided to check it out. Tweepsmap is a a neat tool that can analyze any twitter account from a social network perspective. It can create interactive maps showing where the followers of a twitter account reside , segment followers  and even show who unfollowed you!

Here is my Followers map generated by country.

You can create the followers map based on city and state as well.

Tweepsmap also provides demographic information such as languages, occupation and gender but it relies on the twitter user having entered this information in the twitter profile.

There is also a hashtag and keyword analyzer that reports on most prolific tweeters, locations of tweets, tweets vs. retweets and so on. I used their free report which is built for a maximum of 100 tweets to analyze the trending hashtag –> #BeautyAndTheBeast. For some reasons, the #BeautyAndTheBeast hashtag is really popular in Brazil, out of the 100 tweets, 26 were from Brazil and 20 from USA.  You can see that 3 out of 5 of the top influencers with most followers are tweeting in Portuguese. Other visualizations included  the tweets vs. retweet numbers and the distribution reach of the tweeters. I was even able to get make the report public so you can check it out here.  Remember it only analyzes 100 tweets so don’t draw any conclusions from it !


 If you are doing research on social media or are a business that wants to learn more about competitors and customers, tweepsmap helps you analyze specific twitter accounts as well! Of course, we all know there is no such thing as a free lunch , so this is a paid feature!

From what I saw by tinkering with the pricing calculator on their page, the analysis of a twitter account with more than 2.5M followers will cost a flat fee of $5K. I tried a few twitter accounts to see how much each would cost based on number of followers and found that the cost per follower was $0.002.  So if you wanted to get twitter data on Hans Rosling, it would cost you $642 as he has 320,956 followers (642/320,956 = 0.002).

 calculate3  calculate5   calculate4  calculate6 calculate7 calculate8calculate9       calculate2

Overall this looks like a neat tool to get started when analyzing twitter data and using this information to maximize the returns on your tweets. I have only mentioned a few of their tools above; they have other features like the Best Time to Tweet which will analyze your audience, twitter history, time zones and so on to predict when you get the most out of your tweet. Check out their website for more info here

Read more…

Originally posted on Data Science Central

Written by sought-after speaker, designer, and researcher Stephanie D. H. Evergreen, Effective Data Visualizationshows readers how to create Excel charts and graphs that best communicate data findings. This comprehensive how-to guide functions as a set of blueprints—supported by research and the author’s extensive experience with clients in industries all over the world—for conveying data in an impactful way. Delivered in Evergreen’s humorous and approachable style, the book covers the spectrum of graph types available beyond the default options, how to determine which one most appropriately fits specific data stories, and easy steps for making the chosen graph in Excel. 

The book is available, here

DSC Resources

Additional Reading

Follow us on Twitter: @DataScienceCtrl | @AnalyticBridge

Read more…

Featured Blog Posts - DSC

Webinar Series

Follow Us

@DataScienceCtrl | RSS Feeds