Subscribe to our Newsletter

Featured Posts (202)

Sort by

Visualize Your Data Using

Guest blog post by Vozag

Creating interactive visualizations of your data for web is a cakewalk using, all you need to do is to import your spreadsheet and start generating your interactive visualizations

You need to sign in to start creating visualizations on If you dont have an account, signing up is easy. All you need is a name, an email id and a password, and the account is created immediately

Add your data

Importing data into is very easy.You can import any spreadsheet by clicking on the ‘Import spreadsheet’ link on the right side of the home screen. There is an option for importing either csv or excel files. You can also choose to import data directly from a google spreadsheets by pasting the URL.


Once you choose a file to import, and then click open, you are taken to a screen where you are shown a number of rows in your data.It is to be noted that each row in the spread sheet is considered to be a page. You can set which column is to be used as a title for each page and how each column should be interpreted. You may also choose to ignore some columns depending upon your requirements.

Once you are done, you can finish importing the data by clicking the ‘Start import’ button on the top right of the page. The time taken to import the data depends upon the number of rows and the type of data you are importing.



Explore Your Collection

Once the data is imported, its shown shown in a dashboard. You can start exploring your data by clicking the ‘Explore your collection’ link, which is just under the collection name

By default, the data is shown as a table. You can change the visualization by clicking the drop down button available on the right side of the explore page.


Apart from a table format, you have the option of visualizing your data as a list, grid, pie chart, bar chart, scatter plot, map and a few other forms.

Generate and save visualizations


Generating visualizations of maps is very easy in given that it can convert addresses into latitude-longitude pairs and plot them on the map directly. You will understand that it saves a lot of pain, if you have worked on generating map related visualizations previously. But, proper care should be taken to give full address or the plots generated will be erratic.

Get started by choosing the map visualization in the drop down. You can choose the columns you want to show in the pop up for a location in the map; choose the column of the numeric data if you are plotting a heat map; and choose the column on which you want to sort the data from the options given. Most importantly, you have to choose the column which has the location information and the visualization will be generated accordingly. You can choose to add the visualization to your any of your pages.

Given above is a heat map showing home values in the top 200 counties in USA.

Share your work

Once your visualizations are generated, you can choose to email them, share them on social networking platforms like facebook, twitter, linkedin, or embed them in any of your web pages by clicking the ‘Share and embed’ button on the top right corner of the page.

Interactive visualization of the map is given here.  

Read more…

Guest blog post by Jean Villedieu

The European Union is giving free access to detailed information about public purchasing contracts. That data describes which European institutions are spending money, for what and who benefits from it. We are going to explore the network of public institutions and suppliers and look for interesting patterns with graph visualization.

Public spending in Europe is a €1.3 trillion business

Every year, more than €1.3 trillion in contracts are awarded by public entities in Europe. In an effort to make these public contracts more transparent, the European Union has decided to make the tenders announcements public. The information can be found online through the EU’s open data portalOpenTED, a non profit organization, has gone a step further and made the data available in a CSV format.

Public contracts are complex though. It involves at least one commercial entity which is awarded a contract by a public authority. The public entity may be acting as a “delegate” for another public entity. The contract can be disputed in a certain jurisdiction of appeal.

We have multiple entities and relationships. What this means is that the tenders data describe a graph or network. We are going to explore the tenders graph with Neo4j and Linkurious.

Modeling public contracts as a graph

We will focus on the 2015 tenders. There are 73,269 tenders in one single CSV file with 45 columns.

tenders excel

We decided to model the graph in the following way:

tender data model neo4j

The graph model above highlights the relationships between the contracts, appeal bodies, operators, delegates and authorities in our data.

To put it in a Neo4j database, we wrote a script that you can see here.

This script takes the 2015 tenders data and turn it into a Neo4j graph database with 161,541 nodes and 536,936 edges. Now that it’s in Neo4j, we can search, explore and visualize it with Linkurious. It’s time to start asking questions to the data!

The biggest tenders and the authorities and companies which are involved

As a first step, let’s try to identify the big public contracts which happened in Europe in 2015 and what organizations they involved. In order to get that answer, we’ll use Cypher(the Neo4j query language) within Linkurious.

Here’s how to find the 10 biggest contracts and the public authorities and commercial providers they involved.

// The top 10 biggest contracts and who they involve
WHERE b.contract_initial_value_cost_eur <> ‘null’
ORDER BY b.contract_initial_value_cost_eur DESC
RETURN a, c, b

The result is the following graph.

biggest public contracts

We can see for example that Ministry of Defence has awarded a large contract toBabcock International Group Plc.

Missed connections

The graph structure of the data also allows us to spot missed opportunities. Let’s take for example KPMG. It’s one of the biggest recipient of public contracts within our dataset. What other opportunities the company could have been awarded?

To answer that question, we can identify which of KPMG’s customers awarded contracts to its competitors.

Let’s identify the biggest missed opportunities of KPMG:

// KPMG’s biggest missed opportunities
MATCH (a:OPERATOR {official_name: ‘KPMG’})-[:IS_OPERATOR_OF]->(b:CONTRACT)<-[:IS_AUTHORITY_OF]-(customer:AUTHORITY)-[:IS_AUTHORITY_OF]->(lost_opportunity:CONTRACT)<-[:IS_OPERATOR_OF]-(competitor:OPERATOR)
WHERE NOT a-[:IS_OPERATOR_OF]->competitor
RETURN a,b,customer,lost_opportunity,competitor

We can visualize the result in Linkurious:

KPMG's network.

KPMG’s network.

KPMG is highlighted in red. It is connected to 10 contracts by 9 customers. These customers have awarded 246 other contracts to 181 firms.

This visualization could be used by KPMG to identify its “unfaithful” customers and its competitors. Specifically we may want to filter the visualization to focus on the contracts for similar services to the ones KPMG offer. To do that we will use the CPV code, a procurement taxonomy of products and services.

Here is a visualization filtered to only display contracts with similar CPV codes as KMPG’s contracts:

Filtering on contracts KPMG could have won.

Filtering on contracts KPMG could have won.

If we zoom in, we can see for example that GCS Uniha, one of KMPG’s customer, has also awarded contracts to some of KPMG’s competitor (Ernst and YoungPwC and Deloitte).

GCS Uniha has also awarded contracts to Ernst and Young, PwC and Deloitte.

GCS Uniha has also awarded contracts to Ernst and Young, PwC and Deloitte.

Exploring the French IT sector

Finally, we can use Linkurious to visualize a particular economic sector. Let’s focus for example on IT spending in France. Who are the biggest spenders in that sector? Which companies are capturing these contracts? Finally what are the relationships between all these organizations?

Using a Cypher query, we can identify the customers and suppliers linked to IT contracts (which all have a CPV code starting with “48”):

//The French public IT market
WHERE b.contract_cpv_code STARTS WITH “48” AND = ‘FR’
RETURN a,b,c

We can visualize the result directly in Linkurious.

Public IT contracts in France in 2015.

Public IT contracts in France in 2015.

If we zoom in, we can see that Conseil Général des Bouches du Rhône has awarded 9 IT contracts in 2015, mostly to Dalkia and Idex énérgies.

Conseil Général des Bouches du Rhône has awarded 9 IT contracts in 2015.

Conseil Général des Bouches du Rhône has awarded 9 IT contracts in 2015.

We have dived into €1.3 trillion of public spending using Neo4j and Linkurious. We were able to identify some key actors of the public contracts market and map interesting ecosystems. Want to explore and understand your graph data? Simply try the demo of Linkurious!

Read more…

Space is limited.
Reserve your Webinar seat now

Web and graphic design principals are tremendously useful for creating beautiful, effective dashboards. In this latest Data Science Central Webinar event, we will consider how common design mistakes can diminish visual effectiveness. You will learn how placement, weight, font choice, and practical graphic design techniques can maximize the impact of your visualizations.

SpeakerDave Powell, Solution Architect -- Tableau
Hosted byBill Vorhies, Editorial Director -- Data Science Central

Again, Space is limited so please register early:
Reserve your Webinar seat now

After registering you will receive a confirmation email containing information about joining the Webinar.
Read more…

The current state of machine intelligence 2.0

Guest blog post by Laetitia Van Cauwenberge

Interesting O'Reilly article, although focused on the business implications, not the technical aspects. I really liked the infographic, as well as these statements:

  • Startups are considering beefing up funding earlier than they would have, to fight inevitable legal battles and face regulatory hurdles sooner.
  • Startups are making a global arbitrage (e.g., health care companies going to market in emerging markets, drone companies experimenting in the least regulated countries).
  • The “fly under the radar” strategy. Some startups are being very careful to stay on the safest side of the grey area, keep a low profile, and avoid the regulatory discussion as long as possible.

This is a long article with the following sections:

  • Reflections on the landscape
  • Achieving autonomy
  • The new (in)human touch
  • 50 shades of grey markets 
  • What’s your (business) problem?
  • The great verticalization
  • Your money is nice, but tell me more about your data

Note that we are ourselves working on data science 2.0. Some of our predictions can be found here. And for more on machine intelligence (not sure how it is different from machine learning), click on the following links:

To read the full article, click here. Below is the infographics from the original article.

DSC Resources

Additional Reading

Follow us on Twitter: @DataScienceCtrl | @AnalyticBridge

Read more…

Comparing Data Science and Analytics

You may think that all big data experts are created equal, but nothing could be further from the truth. However, the terms “data scientist” and “business analyst” are often used interchangeably. It’s a common and confusing use of terminology, which is why [email protected], a masters in business analytics, created this infographic to help create further clarity about the two roles

Both business analysts and data scientists are experts in the use of big data, but they have differenttypes of educational backgrounds, usually work in different types of settings and use their skills and knowledge in completely different ways.

Reflective of the increasing need to extract value from the mountain of big data at our fingertips, business analysts are in much higher demand—with a predicted job growth of 27 percent over the next decade. They dig into a variety of sources to pull data for analysis of past, present and future business performance, and then take those results and use the most effective tools to translate them to business leaders. They typically have educational backgrounds in specialties like business and humanities.

In contrast — data scientists are digital builders. They have strong educational backgrounds in computer science, mathematics and technology and use statistical programming to actually build the framework for gathering and using the data by creating and implementing algorithms to do it. Such algorithms help with decision-making, data management, and the creation of visualizations to help explain the data that they gather.

To find out more about who will fit the bill for your organization, dig into the infographic to make sure the big data expert you’re hiring is the right one to meet your needs.

Originally posted on Data Science Central

Read more…

Two great visualizations about data science

Guest blog post by Laetitia Van Cauwenberge

The first one is about the difference between Data Science, Data Analysis, Big Data, Data Analytics, and Data Mining:

The source for this one is, according to a tweet, I could not find the article in question, though this website is very interesting, but anyway, I love the above picture, please share if you agree with it. If someone knows the link for the original article, please post in the comment section.

And now another nice picture about the history of big data and data science:

Here I have the reference for the picture: click here

DSC Resources

Additional Reading

Follow us on Twitter: @DataScienceCtrl | @AnalyticBridge

Read more…

Main Trends for IT in 2016 (infographic)

Guest blog and infographic from Glorium Technologies

Spiceworks report defines 4 main IT trends of upcoming 2016.

  1. Companies’ revenue is to grow while IT budget is to remain the same.
  2. As a consequence, CIOs are to make more with less.
  3. Even though security concers are going to increase, security expencses are to stay flat.
  4. The leading priority of IT busget 2016 is end of life of a product.


On average, each company plans to increase IT spending by max $2000 (comparing 2015 and 2016, worldwide). According to results of responses, 42% claim that budget remains the same, 38% states that it is to increase, while 10% is bound to decrease it. The reason for such outcome is willingness to keep the cost low. IT stuff is not to increase, either: 59% – no changes, 34% – plans to increase, 4% plans to decrease.


The main priorities of 2016 IT budget are: hardware expenses (37%), software expenses (31%), cloud-based projects (14%) and software expenses (13%).


Surprisingly, laptops are not to take the first place in 2016, instead of desktops. Here are how hardware expenses are to be distributed: 21% desktops, 19% servers, 16% laptops, 10% networking, 6% external storage, 6% mobile and tablets.


Software expenses are divided more evenly: 15% goes to virtualization, OS and productivity. 10% goes to CRM, backup and database. 9% goes to security.
The main motto of expenses: “If it ain’t broken – don’t fix it”.


Here are the main reasons foe companies’ investing in IT: end of life, growth or additional needs, upgrades or refresh cycles, end user need, project need, budget availability, application compatibility and new technologies or features.


Responses to the question “Is your security on a descent level?” are rather surprising: 61% do not conduct security audit, 59% believe that  their investments in security are not adequate, for 51% security is not a 2016 priority, 48% regard their business process as not adequately protected.




  1. Companies’ revenue is going to increase, while IT budet is not going to be changed.
  2. CIO’s will have to solve strategic issues for the future with budget from the past.
  3. End of life of a product is the main boost for the IT investments.
  4. Even though, companies understand that they do not invest enough in security, security budget is to stay rather low.


Infographic preparation was based upon Spiceworks annual report on IT budgets and tech trends.

Originally posted on Data Science Central

Read more…

The Amazing Ways Uber Is Using Big Data

Guest blog post by Bernard Marr

Uber is a smartphone-app based taxi booking service which connects users who need to get somewhere with drivers willing to give them a ride. The service has been hugely controversial, due to regular taxi drivers claiming that it is destroying their livelihoods, and concerns over the lack of regulation of the company’s drivers.

Source for picture: Mapping a city’s flow using Uber data

This hasn’t stopped it from also being hugely successful – since being launched to purely serve San Francisco in 2009, the service has been expanded to many major cities on every continent except for Antarctica.

The business is rooted firmly in Big Data and leveraging this data in a more effective way than traditional taxi firms have managed has played a huge part in its success.

Uber’s entire business model is based on the very Big Data principle of crowd sourcing. Anyone with a car who is willing to help someone get to where they want to go can offer to help get them there.  

Uber holds a vast database of drivers in all of the cities it covers, so when a passenger asks for a ride, they can instantly match you with the most suitable drivers.

Fares are calculated automatically, using GPS, street data and the company’s own algorithms which make adjustments based on the time that the journey is likely to take. This is a crucial difference from regular taxi services because customers are charged for the time the journey takes, not the distance covered.

Surge pricing

These algorithms monitor traffic conditions and journey times in real-time, meaning prices can be adjusted as demand for rides changes, and traffic conditions mean journeys are likely to take longer. This encourages more drivers to get behind the wheel when they are needed – and stay at home when demand is low. The company has applied for a patent on this method of Big Data-informed pricing, which is calls “surge pricing”.

This algorithm-based approach with little human oversight has occasionally caused problems – it was reported that fares were pushed up sevenfold by traffic conditions in New York on New Year’s Eve 2011, with a journey of one mile rising in price from $27 to $135 over the course of the night.

This is an implementation of “dynamic pricing” – similar to that used by hotel chains and airlines to adjust price to meet demand – although rather than simply increasing prices at weekends or during public holidays, it uses predictive modelling to estimate demand in real time.  


Changing the way we book taxis is just a part of the grand plan though. Uber CEO Travis Kalanick has claimed that the service will also cut the number of private, owner-operated automobiles on the roads of the world’s most congested cities. In an interview last year he said that he thinks the car-pooling UberPool service will cut the traffic on London’s streets by a third.

UberPool allows users to find others near to them which, according to Uber’s data, often make similar journeys at similar times, and offer to share a ride with them. According to their blog, introducing this service became a no-brainer when their data told them the “vast majority of [Uber trips in New York] have a look-a-like trip – a trip that starts near, ends near, and is happening around the same time as another trip”. 

Other initiatives either trialled or due to launch in the future include UberChopper, offering helicopter rides to the wealthy, UberFresh for grocery deliveries and Uber Rush, a package courier service.

Rating systems

The service also relies on a detailed rating system – users can rate drivers, and vice versa – to build up trust and allow both parties to make informed decisions about who they want to share a car with.

Drivers in particular have to be very conscious of keeping their standards high – a leaked internal document showed that those whose score falls below a certain threshold face being “fired” and not offered any more work.

They have another metric to worry about, too – their “acceptance rate”. This is the number of jobs they accept versus those they decline. Drivers were told they should aim to keep this above 80%, in order to provide a consistently available service to passengers.

Uber’s response to protests over its service by traditional taxi drivers has been to attempt to co-opt them, by adding a new category to its fleet. UberTaxi - meaning you will be picked up by a licensed taxi driver in a registered private hire vehicle - joined UberX (ordinary cars for ordinary journeys), UberSUV (large cars for up to 6 passengers) and UberLux (high end vehicles) as standard options.

Regulatory pressure and controversies

It will still have to overcome legal hurdles – the service is currently banned in a handful of jurisdictions including Brussels and parts of India, and is receiving intense scrutiny in many other parts of the world. Several court cases are underway in the US regarding the company’s compliance with regulatory procedures.  

Another criticism is that because credit cards are the only payment option, the service is not accessible to a large proportion of the population in less developed nations where the company has focused its growth.

But given its popularity wherever it has launched around the world, there is a huge financial incentive for the company to press ahead with its plans for revolutionising private travel.

 If regulatory pressures do not kill it, then it could revolutionise the way we travel around our crowded cities – there are certainly environmental as well as economic reasons why this would be a good thing.

Uber is not alone – it has competitors offering similar services on a (so far) smaller scale such as Lyft , Sidecar and Haxi. If a deregulated private hire market emerges through Uber’s innovation, it will be hugely valuable, and competition among these upstarts will be fierce. We can expect the winners to be those who make the best use of the data available to them, to improve the service they offer to their customers.

The most successful is likely to be the one which manages to best use the data available to it to improve the service it provides to customers.

Case study - how Uber uses big data - a nice, in-depth case study how they have based their entire business model on big data with some practical examples and some mention of the technology used.

I hope you found this post interesting. I am always keen to hear your views on the topic and invite you to comment with any thoughts you might have.

About : Bernard Marr is a globally recognized expert in analytics and big data. He helps companies manage, measure, analyze and improve performance using data.

His new book is: Big Data: Using Smart Big Data, Analytics and Metrics To Make Better Decisions and Improve Performance You can read a free sample chapter here

DSC Resources

Additional Reading

Follow us on Twitter: @DataScienceCtrl | @AnalyticBridge

Read more…

The 3Vs that define Big Data

Guest blog post by Diya Soubra

As I studied the subject, the following three terms stood out in relation to Big Data.

Variety, Velocity and Volume.

In marketing, the 4Ps define all of marketing using only four terms.
Product, Promotion, Place, and Price.

I claim that the 3Vs above totally define big data in a similar fashion.
These three properties define the expansion of a data set along various fronts to where it merits to be called big data. An expansion that is accelerating to generate yet more data of various types.

The plot above, using three axes helps to visualize the concept.

Data Volume:
The size of available data has been growing at an increasing rate. This applies to companies and to individuals. A text file is a few kilo bytes, a sound file is a few mega bytes while a full length movie is a few giga bytes.
More sources of data are added on continuous basis. For companies, in the old days, all data was generated internally by employees. Currently, the data is generated by employees, partners and customers. For a group of companies, the data is also generated by machines. For example, Hundreds of millions of smart phones send a variety of information to the network infrastructure. This data did not exist five years ago.
More sources of data with a larger size of data combine to increase the volume of data that has to be analyzed. This is a major issue for those looking to put that data to use instead of letting it just disappear.
Peta byte data sets are common these days and Exa byte is not far away.

Large Synoptic Survey Telescope (LSST).
“Over 30 thousand gigabytes (30TB) of images will be generated every night during the decade -long LSST sky survey. ”
72 hours of video are uploaded to YouTube every minute

There is a corollary to Parkinson’s law that states: “Data expands to fill the space available for storage.”’s_law

This is no longer true since the data being generated will soon exceed all available storage space.

Data Velocity:
Initially, companies analyzed data using a batch process. One takes a chunk of data, submits a job to the server and waits for delivery of the result. That scheme works when the incoming data rate is slower than the batch processing rate and when the result is useful despite the delay. With the new sources of data such as social and mobile applications, the batch process breaks down. The data is now streaming into the server in real time, in a continuous fashion and the result is only useful if the delay is very short.
140 million tweets per day on average.( more in 2012)

I have not yet determined how data velocity may continue to increase since real time is as fast as it gets. The delay for the results and analysis will continue to shrink to also reach real time.

Data Variety:
From excel tables and databases, data structure has changed to loose its structure and to add hundreds of formats. Pure text, photo, audio, video, web, GPS data, sensor data, relational data bases, documents, SMS, pdf, flash, etc etc etc. One no longer has control over the input data format. Structure can no longer be imposed like in the past in order to keep control over the analysis. As new applications are introduced new data formats come to life.

Google uses smart phones as sensors to determine traffic conditions.
In this application they are most likely reading the speed and position of millions of cars to construct the traffic pattern in order to select the best routes for those asking for driving directions. This sort of data did not exist on a collective scale a few years ago.

The 3Vs together describe a set of data and a set of analysis conditions that clearly define the concept of big data.


So what is one to do about this?

So far, I have seen two approaches.
1-divide and concur using Hadoop
2-brute force using an “appliance” such as the SAP HANA
(High- Performance Analytic Appliance)

In the divide and concur approach, the huge data set is broken down into smaller parts (HDFS) and processed (Mapreduce) in a parallel fashion using thousands of servers.

As the volume of the data increases, more servers are added and the process runs in the same manner. Need a shorter delay for the result, add more servers again. Given that with the cloud, server power is infinite, it is really just a matter of cost. How much is it worth to get the result in a shorter time.

One has to accept that not ALL data analysis can be done with Hadoop. Other tools are always required.

For the brute force approach, a very powerful server with terabytes of memory is used to crunch the data as one unit. The data set is compressed in memory. For example, for a Twitter data flow that is pure text, the compression ratio may reach 100:1. A 1TB IBM SAP HANA can then load a data set of 100TB in memory and do analytics on it.

IBM has a 100TB unit for demonstration purposes.

Many other companies are filling in the gap between these two approaches by releasing all sorts of applications that address different steps of the data processing sequence plus the management and the system configuration.

Read more…

The Data Science Industry: Who Does What

Guest blog post by Laetitia Van Cauwenberge

Interesting infographics produced by, an organisation offering R and data science training. Click here to see the original version. I would add that one of the core competencies of the data scientist is to automate the process of data analysis, as well as to create applications that run automatically in the background, sometimes in real-time, e.g.

  • to find and bid on millions of Google keywords each day (eBay and Amazon do that, and most of these keywords have little or no historical performance data, so keyword aggregation algorithms - putting keywords in buckets - must be used to find the right bid based on expected conversion rates),
  • buy or sell stocks,
  • monitor networks and generate automated alerts sent to the right people (to warn about a potential fraud, etc.)
  • or to recommend products to a user, identify optimum pricing, manage inventory, or identify fake reviews (a problem that Amazon and Yelp have failed to solve to this day)  

DSC Resources

Additional Reading

Follow us on Twitter: @DataScienceCtrl | @AnalyticBridge

Read more…

Google is a prolific contributor to Open source. Here is a list of 4 open source & cloud projects from Google focusing on analytics, machine learning, data cleansing & visualization.


TensorFlow™ is an open source software library for numerical computation using data flow graphs. Nodes in the graph represent mathematical operations, while the graph edges represent the multidimensional data arrays (tensors) communicated between them. The flexible architecture allows you to deploy computation to one or more CPUs or GPUs in a desktop, server, or mobile device with a single API. TensorFlow was originally developed by researchers and engineers working on the Google Brain Team within Google's Machine Intelligence research organization for the purposes of conducting machine learning and deep neural networks research, but the system is general enough to be applicable in a wide variety of other domains as well.


OpenRefine (formerly Google Refine) is a powerful tool for working with messy data: cleaning it; transforming it from one format into another; and extending it with web services and external data.Please note that since October 2nd, 2012, Google is not actively supporting this project, which has now been rebranded to OpenRefine. Project development, documentation and promotion is now fully supported by volunteers.

Google Charts

Google Charts provides a perfect way to visualize data on your website. From simple line charts to complex hierarchical tree maps, the chart gallery provides a large number of ready-to-use chart types.

The most common way to use Google Charts is with simple JavaScript that you embed in your web page. You load some Google Chart libraries, list the data to be charted, select options to customize your chart, and finally create a chart object with an id that you choose. Then, later in the web page, you create a <div> with that id to display the Google Chart.

Automatic Statistician

Making sense of data is one of the great challenges of the information age we live in. While it is becoming easier to collect and store all kinds of data, from personal medical data, to scientific data, to public data, and commercial data, there are relatively few people trained in the statistical and machine learning methods required to test hypotheses, make predictions, and otherwise create interpretable knowledge from this data. The Automatic Statistician project aims to build an artificial intelligence for data science, helping people make sense of their data.

This article is compiled by Jogmon.

Originally posted on Data Science Central

Read more…

Guest blog and great infographic from our friends at

Nowadays, the data science field is hot, and it is unlikely that this will change in the near future. While a data driven approach is finding its way into all facets of business, companies are fiercely fighting for the best data analytic skills that are available in the market, and salaries for data science roles are going in overdrive.

 Companies’ increased focus on acquiring data science talent goes hand in hand with the creation of a whole new set of data science roles and titles. Sometimes new roles and titles are added to reflect changing needs; other times they are probably created as a creative way to differentiate from fellow recruiters. Either way, it’s hard to get a general understanding of the different job roles, and it gets even harder when you’re looking to figure out what job role could best fit your personal and professional ambitions.

The Data Science Industry: Who Does What

DataCamp took a look at this avalanche of data science job postings in an attempt to unravel these cool-sounding and playful job titles into a comparison of different data science related careers. We summarized the results in our latest infographic “The Data Science Industry: Who Does What”:

In this infographic we compare the roles of data scientists, data analysts, data architects, data engineers, statisticians and many more. We have a look at their roles within companies and the data science process, what technologies they have mastered, and what the typical skillset and mindset is for each role. Furthermore, we look at the top employers that are currently hiring these different data science roles and how the average national salaries of these roles map out against each other.

Hopefully this infographic will help you to better understand the different job roles that are available to data passionate professionals.

The original blog and infographic can be seen here.

Originally posted on Data Science Central

Read more…

Guest blog shared by Stefan Kingham at

This infographic displays data from The Economist Intelligence Unit’s ‘Healthcare Outcomes Index 2014’, which took into account a number of diverse and complex factors to produce a ranking of the world’s best-performing countries in healthcare (outcome).

In order to produce a rounded set of outcome rankings, the EIU used basic factors like life expectancy and infant mortality rates alongside weighted factors such as Disability-Adjusted Life Years (DALYs) and Health-Adjusted Life Expectancy (HALEs), whilst also taking ageing populations and adult mortality rates into consideration.

The EIU also produced an overview of how much countries spend each year on healthcare per capita. This spending ranking is based on data from the World Health Organization (WHO).

By plotting the EIU’s outcome rankings against spending rankings for each country, we are able to develop a global overview of how effectively countries use their healthcare budgets.

See the original post here.

Originally posted on Data Science Central

Read more…

DataViz for Cavemen

The late seventies are considered as prehistoric times by most data scientists. Yet it was the beginning of a new era, with people getting their first personal computer, or at least programmable calculators like the one pictured below. The operating system was called DOS, and later became MSdos, for Microsoft Disk Operating System. You could use your TV set as a monitor, and tapes and a tape recorder (then later floppy disks) to record data. Memory was limited to 64KB. Regarding the HP 41 model below, the advertising claimed that with some extra modules, you could write up to 2,000 lines of code, and save it permanently. I indeed started my career (and even inverted matrices with it), back in high school, with the very model featured below, offered as a birthday present. Math teachers were afraid by these machines, I believed they were banned from schools at some point.

One of the interesting features in these early times was that there was no real graphic device, not for personal use anyway (sure publishers had access to expensive plotting machines back then). So the trick was to produce graphs and images using only ASCII chars. Typical monitors could display 25 lines, each with 40 characters, in fixed font (courier font). More advanced systems would allow you to switch between two virtual screens, thus extending the length of a line to 80 chars. 

Here are some of the marvels that you could produce back then - now this is considered an art. Has anyone ever made a video using just ASCII chars like in this picture? If anything, it shows how big data is shallow: a 1024 x 1024 image (or a video made up of hundreds of such frames) can be compressed by a factor 2,000 or more, and yet it still conveys pretty much all the useful information available in the big, original version. This brings another question: could this technique be used for face recognition?

This is supposed to be Obama - see details

Click here for details or to download this text file (the image)!

Originally posted on Data Science Central

DSC Resources

Additional Reading

Follow us on Twitter: @DataScienceCtrl | @AnalyticBridge

Read more…

Guest blog by Fabio Souto

A curated list of awesome data visualizations frameworks, libraries and software. Inspired by awesome-python.

Table of contents

JavaScript tools

Charting libraries

  • C3 - a D3-based reusable chart library.
  • Chart.js - Charts with the canvas tag.
  • Charted - A charting tool that produces automatic, shareable charts from any data file.
  • Chartist.js - Responsive charts with great browser compatibility.
  • Dimple - An object-oriented API for business analytics.
  • Dygraphs - Interactive line charts library that works with huge datasets.
  • Echarts - Highly customizable and interactive charts ready for big datasets.
  • Epoch - Perfect to create real-time charts.
  • Highcharts - A charting library based on SVG and VML rendering. Free (CC BY-NC) for non-profit projects.
  • MetricsGraphics.js - Optimized for time-series data.
  • Morris.js - Pretty time-series line graphs.
  • NVD3 - A reusable charting library written in d3.js.
  • Peity - A library to create small inline svg charts.
  • TechanJS - Stock and financial charts.

Charting libraries for graphs

  • Cola.js - A tool to create diagrams using constraint-based optimization techniques. Works with d3 and svg.js.
  • Cytoscape.js - JavaScript library for graph drawing maintained by Cytoscape core developers.
  • Linkurious - A toolkit to speed up the development of graph visualization and interaction applications. Based on Sigma.js.
  • Sigma.js - JavaScript library dedicated to graph drawing.
  • VivaGraph - Graph drawing library for JavaScript.


  • CartoDB - CartoDB is an open source tool that allows for the storage and visualization of geospatial data on the web.
  • Cesium - WebGL virtual globe and map engine.
  • Leaflet - JavaScript library for mobile-friendly interactive maps.
  • Leaflet Data Visualization Framework - A framework designed to simplify data visualization and thematic mapping using Leaflet.
  • Mapsense.js - Combines d3.js with tile maps.
  • Modest Maps - BSD-licensed display and interaction library for tile-based maps in Javascript.



dc.js is an multi-Dimensional charting built to work natively with crossfilter.


  • Chroma.js - A small library for color manipulation.
  • Piecon - Pie charts in your favicon.
  • Recline.js - Simple but powerful library for building data applications in pure JavaScript and HTML.
  • Textures.js - A library to create SVG patterns.
  • Timeline.js - Create interactive timelines.
  • Vega - Vega is a visualization grammar, a declarative format for creating, saving, and sharing interactive visualization designs.
  • Vis.js - A dynamic visualization library including timeline, networks and graphs (2D and 3D).

Android tools

  • HelloCharts - Charting library for Android compatible with API 8+.
  • MPAndroidChart - A powerful & easy to use chart library.

C++ tools

Golang tools

  • Charts for Go - Basic charts in Go. Can render to ASCII, SVG and images.
  • svgo - Go Language Library for SVG generation.

iOS tools

  • JBChartView - Charting library for both line and bar graphs.
  • PNChart - A simple and beautiful chart lib used in Piner and CoinsMan.
  • ios-charts - iOS port of MPAndroidChart. You can create charts for both platforms with very similar code.

Python tools

  • bokeh - Interactive Web Plotting for Python.
  • matplotlib - A python 2D plotting library.
  • pygal - A dynamic SVG charting library.
  • seaborn - A library for making attractive and informative statistical graphics.
  • toyplot - The kid-sized plotting toolkit for Python with grownup-sized goals.

R tools

  • ggplot2 - A plotting system based on the grammar of graphics.
  • rbokeh - R Interface to Bokeh.
  • rgl - 3D Visualization Using OpenGL

Ruby tools

  • Chartkick - Create beautiful JavaScript charts with one line of Ruby.

Other tools

Tools that are not tied to a particular platform or language.

  • Lightning - A data-visualization server providing API-based access to reproducible, web-based, interactive visualizations.
  • RAW - Create web visualizations from CSV or Excel files.
  • Spark - Sparklines for the shell. It have several implementations in different languages.
  • Periscope - Create charts directly from SQL queries.



Twitter accounts



  • Please check for duplicates first.
  • Keep descriptions short, simple and unbiased.
  • Please make an individual commit for each suggestion
  • Add a new category if needed.

Thanks for your suggestions!



To the extent possible under law, Fabio Souto has waived all copyright and related or neighboring rights to this work.

Originally posted on Data Science Central

Read more…

Guest blog post by Laetitia Van Cauwenberge

Your best references to do your job or get started in data science.

  1. Machine Learning on GitHub
  2. Supervised Learning on GitHub
  3. Cheat Sheet: Data Visualization with R
  4. Cheat Sheet: Data Visualisation in Python
  5. scikit-learn Algorithm Cheat Sheet
  6. Vincent Granville's Data Science Cheat Sheet - Basic
  7. Vincent Granville's Data Science Cheat Sheet - Advanced
  8. Cheat Sheet – 10 Machine Learning Algorithms & R Commands
  9. Microsoft Azure Machine Learning : Algorithm Cheat Sheet
  10. Cheat Sheet – Algorithm for Supervised and Unsupervised Learning
  11. Machine Learning and Predictive Analytics, on Dzone
  12. ML Algorithm Cheat Sheet by Laurence Diane
  13. CheatSheet: Data Exploration using Pandas in Python
  14. 24 Data Science, R, Python, Excel, and Machine Learning Cheat Sheets .

Click here for picture source

DSC Resources

Additional Reading

Follow us on Twitter: @DataScienceCtrl | @AnalyticBridge

Read more…

If you are a data scientist or analyst looking to implement Tableau or join a Tableau implementation company, here are 8 companies you can evaluate.

Tableau helps anyone quickly analyze, visualize and share information. Rapid-fire business intelligence gives you the
ability to answer your own questions in minutes.

You can work with all kinds of data, from Hadoop to data warehouses to spreadsheets, and across disparate data sets.

  • Your entire organization is served, from executives to analysts, across departments and geographic locations, in the office or on-the-go.

  • It fits in seamlessly as an extension of your IT infrastructure.

  • You can take advantage of the new generation of user-friendly, visual interfaces to spot outliers and trends in complex data.

  • You, and everyone around you, are self-reliant. When it comes to getting answers from data, you don’t have to wait for anyone or anything.

  • And it’s easy on your budget by providing low cost of ownership and a return on your investment in days or weeks, not months or years.

Tableau puts analytics in the hands of the user. By enabling individual creativity and exploration from the ground floor, businesses now the ability to adapt and outperform competition through intuitive data visualization and analysis.

Tableau can connect to virtually any data source, be it corporate data warehouse, Microsoft Excel or web-based data. It gives users immediate insights by transforming their data into beautiful, interactive visualizations in a matter of seconds. What took expensive teams days or months to develop, now is achieved through the use of a user-friendly drag-and-drop interface.

Boulder Insights

  • Uncover unseen patterns in your data

  • Navigate sophisticated drag and drop dashboards with ease

  • Analyze millions of rows of data ad hoc and in moments

  • Inform smart business decisions

  • Combine multiple data sources for quick analysis

  • View real-time refreshed dashboards

  • Share beautiful and clear dashboards with colleagues

Syntelli Solutions

Tableau’s rapid-fire business intelligence software lets everyone in your organization analyze and understand their data far faster than any other solution and at a fraction of their costs and resources.

  • Ease of Use:

Tableau Desktop lets you create rich visualizations and dashboards with an intuitive, drag-and-drop interface that lets you see every change as you make it. Anyone comfortable with Excel can get up to speed on Tableau quickly.

  • Direct Connect and Go

There is minimum set-up required with Tableau. In minutes you’ll be consolidating numbers, and visualize results without any advance set-up. Helping to free up your IT resources, and enabling you to quickly arrive at results.

  • Perfect Mashups

Connect to data in one click and layer in a second data source with another. Combining data sources in the same view is so easy it feels like cheating.

  • Best Practices in a Box

Tableau has best practices built right in. You get the benefit of years of research on the best way to represent data, from color schemes that convey meaning to an elegant design that keeps users focused on what’s important.

marquis leadership

Tableau lets you use more than one data series to spotlight high and low performers in a single view. The bar chart below shows Sales by Customers, sorted in order of highest sales. I dropped the Profit data onto the Color mark, and, voila! I see not just customers who have the most sales, but also those with highest profits – just by looking at the gradation of colors from red to green. Suddenly it’s easy to see that some of the customers with lower sales are more profitable than customers with higher sales.

TEG Analytics

Tableau is the leading data visualization tool in the Gartner Quadrant And nobody knows Tableau better than us…

  • Business Performance evaluations and performance driver analysis Dashboards for the fortune500 US CPG company

  • Pricing Elasticity Analytics dashboard & simulator for one of the world’s leading shipping and logistics service company

  • Inventory management and inbuilt alert system Dashboards for world’s leading retail company

  • Bug management and planning Dashboards for large IT product development company

  • Campaign management and promotions analysis reporting for an large digital and direct mail agency


Tableau’s most obvious features include its easy, intuitive user interface. Open it and go, anytime and anywhere. Use it on your desktop, on the Web, or on your mobile device. Tableau showed the world what self-service BI and data discovery should be, and it keeps leading the way.

That leadership has been acknowledged year after year by the industry’s touchstone, the Gartner Magic Quadrant. Gartner. It has rated Tableau the most powerful, most intuitive analysis tool on the market for the last four years in a row.

But you don’t see the best benefit until you use it: Users stay in the flow from question to answer to more questions and more answers, all the way to insight. They don’t stop to fiddle with data, they just stay with the analysis.


Tableau Software helps people see and understand data. Used by more than 19,000 companies and organizations worldwide, Tableau’s award-winning software delivers fast analytics and rapid-fire business intelligence. Create visualizations and dashboards in minutes, then share in seconds.The result? You get answers from data quickly, with no programming required. Data is everywhere. But most of us struggle to make sense of it.Tableau Software lets anyone visualize data and then share it on the web, no programming needed. It’s wicked-fast, easy analytics.

Originally posted on Data Science Central

Read more…

An Introduction to Data Visualization

Guest blog post by Divya Parmar

After data science, which I discussed in an earlier post, data visualization is one of the most common buzzwords thrown around in the tech and business communities. To demonstrate how one can actually visualize data, I want to use one of the hottest tools in the market right now: Tableau. You can download Tableau Public for free here, and the “Cat vs. Dog” dataset can be found here. Let’s get started.

1. Play around with the data and find what looks interesting.

After opening Tableau Public and importing my Excel file, I looked over my dataset. I was curious to see if there was relationship between the rate of cat ownership and dog ownership. So I put dog ownership on the x-axis and cat ownership on the y-axis; I then added state name as a label. All of this is done through simply dragging and dropping, and below is a snapshot of how intuitive it is.

2. Add some elements as necessary to show your insight.

There are many ways to build on the preliminary step. You can add something like a trend line to demonstrate a statistical relationship (note that there is a p-value with the trend line), which is done through the "Analysis" tab and adds more credibility. You can even give different colors or sizes to different data points, as I have done below using the number of pet households by state to emphasis the larger states.

3. Fix and improve to make usable for export, presentation, or other purpose.

Data visualization is only useful if it is simple and to the point. In the above example, the District of Columbia data point is an outlier that is making the rest of the graph harder to read. You can edit your axis to not show D.C., and can also remove the confidence bands for the trend line to remove unessential information. 

After your visualization is ready, put it to use by sharing, embedding, or whatever means works for you.  Data visualization is easier than you think, and I encourage you to get started.

About: Divya Parmar is a recent college graduate working in IT consulting. For more posts every week, and to subscribe to his blog, please click here.

Read more…

In order to use Big Data effectively, a holistic approach is required. Organizations are now using data analytics at every level, and roles that previously would have had no need to concern themselves with data are now required to have some degree of understanding in order to leverage insights.

Ensuring that data is presented in such a way as to be understood and utilized by all employees is, however, a challenge. Most Big Data actually yields neither meaning nor value, and the sheer volume coming into businesses can be overwhelming. Companies are therefore increasingly moving away from simple 2D Excel charts, and replacing or supplementing them with powerful data visualization tools.

Sophisticated data visualization is a tool that supports analytic reasoning. It accommodates the large numbers of data points provided by Big Data using additional dimensions, colors, animations, and the digital equivalents of familiar items such as dials and gauges. The user can look at the graphics provided to reveal entanglements that may otherwise be missed, and patterns in the data can be displayed to the user at great speeds. Nicholas Marko, Chief Data Officer at Geisinger Health System, notes that: ‘You need to represent a lot of information data points in a small space without losing the granularity of what you're trying to reflect. That's why you're seeing different kinds of graphics, infographics, and more art-like or cartoon-like visualizations. It's certainly better than increasing the density of dots on a graph. At a certain point, you're not getting anything more meaningful.’

There are numerous benefits to data visualization. It reduces dependence on IT, allowing them to focus on adding value and optimizing processes. Increasing the speed that data is analyzed also means that it can be acted upon quicker. Intel’s IT Manager Survey found IT managers expect that 63% of all analytics will be done in real time, and those who cannot understand and act on data in this way, face losing their competitive advantage to others.

Data visualization for mobile, in particular, is becoming increasingly important. The requirement for real time analytics, the ubiquitousness of mobile devices, and the need to have information in real time, means that many data visualization vendors are now either adapting desktop experiences to mobile formats, or taking a mobile-first approach to developing their technology. There are obvious constraints, with smartphone display space limited. Information needs to be presented more simply than on a desktop, so, rather than just translating a complex desktop view into a simpler mobile one, it is important to consider context. Designers are also exploiting gesture-based input to help users easily navigate different views and interact with the data.

Gartner estimated a 30% compound annual growth rate in the use of data analytics through 2015. Visualization data discovery tools offer a tremendous opportunity to manage and leverage the growing volume, variety, and velocity of new and existing data. Using the faster, deeper insights afforded, companies are more agile, and have a significant competitive advantage moving forward. 

Originally posted on Data Science Central

Read more…

Title: How Flextronics uses DataViz and Analytics to Improve Customer Satisfaction
Date: Tuesday, December 15, 2015
Time: 09:00 AM Pacific Standard Time
Duration: 1 hour

Space is limited.
Reserve your Webinar seat now

Flexibility in adapting to changing global markets and customer needs is necessary to stay competitive, and the Flextronics analytics team is tasked with making sure the Flex management team has accurate and up-to-date analytics to optimize their business’s performance, efficiency, and customer service.

In our latest DSC webinar series, Joel Woods from Flextronics’ Global Services and Solutions will share success stories around analytics for the repairs and refurbishment of customer products utilizing analytics and data visualization from Tableau and Alteryx.

You will learn how to:

  • Use data analytics to improve cost savings
  • Resolve common data challenges such as blending disparate data sources
  • Deliver automated and on-demand reporting to clients
  • Provide visualizations that show the analytics that matter to both internal teams and customers

About Flextronics:

Flextronics is an industry leading end-to-end supply chain solutions company with $26 billion in sales, generated from helping customers design, build, ship, and service their products through an unparalleled network of facilities in approximately 30 countries and across four continents. 


Ross Perez, Sr Product Manager -- Tableau
Joel Woods, Advanced Analytics Lead -- Flex Inc
Maimoona Block, Alliance Manager -- Alteryx 

Hosted byBill Vorhies, Editorial Director -- Data Science Central

Again, Space is limited so please register early:
Reserve your Webinar seat now

After registering you will receive a confirmation email containing information about joining the Webinar.
Read more…

Featured Blog Posts - DSC

Webinar Series

Follow Us

@DataScienceCtrl | RSS Feeds