Subscribe to our Newsletter

Featured Posts (202)

Sort by

New Books and Resources for DSC Members


We are in the process of writing and adding new material (compact eBooks) exclusively available to our members, and written in simple English, by world leading experts in AI, data science, and machine learning. We invite you to sign up here to not miss these free books. 



Currently, the following content is available:

2058338992?profile=original1. Statistics: New Foundations, Toolbox, and Machine Learning Recipes 

By Vincent Granville. This book is intended for busy professionals working with data of any kind: engineers, BI analysts, statisticians, operations research, AI and machine learning professionals, economists, data scientists, biologists, and quants, ranging from beginners to executives. In about 300 pages and 28 chapters it covers many new topics, offering a fresh perspective on the subject, including rules of thumb and recipes that are easy to automate or integrate in black-box systems, as well as new model-free, data-driven foundations to statistical science and predictive analytics. The approach focuses on robust techniques; it is bottom-up (from applications to theory), in contrast to the traditional top-down approach. The material is accessible to practitioners with a one-year college-level exposure to statistics and probability. The compact and tutorial style, featuring many applications with numerous illustrations, is aimed at practitioners, researchers, and executives in various quantitative fields.

The table of content is available here. The book can be accessed here (members only.)

2058338992?profile=original2. Deep Learning and Computer Vision with CNNs 

By Dan Howarth and Ajit Jaokar, October 2019. 42 pages. Part 1 will introduce the core concepts of Deep Learning. We will also start coding straightaway with Tensorflow 2.0. In part 2, we use another dataset - the mnist dataset - to build on our knowledge. In particular, we will:

  • Introduce Computer Vision
  • Introduce convolutional layers into our models
  • Introduce the concept of regularisation
  • Introduce the validation set in training our model
  • Introduce how to save and reuse our model

The table of content is available here. The book can be accessed here (member sonly.)

3653893658?profile=RESIZE_710x3. Getting Started with TensorFlow 2.0

By Amita Kapoor and Ajit Jaokar. In this book, we introduce coding with tensorflow 2.0. We show how to develop with tensorflow 1.0 and contrast how the same code can be developed in tensorflow 2.0. The book emphasizes the unique features of tensorflow 2.0. Earlier this year, Google announced TensorFlow 2.0, it is a major leap from the existing TensorFlow 1.0. The key differences are:

  • Ease of use
  • Eager Execution
  • Model Building and deploying made easy
  • The Data pipeline simplified

The table of content is available here. The book can be accessed here (members (members only.)

2058338992?profile=original4. Book: Classification and Regression In a Weekend

This tutorial began as a series of weekend workshops created by Ajit Jaokar and Dan Howarth. The idea was to work with a specific (longish) program such that we explore as much of it as possible in one weekend. This book is an attempt to take this idea online. The best way to use this book is to work with the Python code as much as you can. The code has comments.  But you can extend the comments by the concepts explained here.

The table of contents is available here. The book can be accessed here (members only.)

2058338992?profile=original5. Online Encyclopedia of Statistical Science

This online book is intended for beginners, college students and professionals confronted with statistical analyses. It is also a refresher for professional statisticians.  The book covers over 600 concepts, chosen out of more than 1,500 for their popularity. Entries are listed in alphabetical order, and broken down into 18 parts. In addition to numerous illustrations, we have added 100 topics not covered in our online series Statistical Concepts Explained in Simple English. We also included a number of visualizations from our series Statistical Concepts Explained in One Picture.

The table of content is available here. The book can be accessed here (members only.)

2058338992?profile=original6. Book: Azure Machine Learning in a Weekend

This book by Ajit Jaokar and Ayse Mutlu is the second book in the ‘in a weekend’ series – after Classification and Regression in a weekend. The idea of the ‘in a weekend’ series of books is to study one complex section of code in a weekend to master the concept. Cloud computing changes the development paradigm. Specifically, it combines development and deployment (the DevOps approach). In complex environments, the developer has to know more than the coding. Rather, she has to be familiar with both the data engineering and the DevOps. This book helps you to get started with coding a complex AI application for the Cloud(Azure).

The table of content is available here. The book can be accessed here (members only.)

2058338992?profile=original7. Book: Enterprise AI - An Application Perspective 

Enterprise AI: An applications perspective takes a use case driven approach to understand the deployment of AI in the Enterprise. Designed for strategists and developers, the book provides a practical and straightforward roadmap based on application use cases for AI in Enterprises. The authors (Ajit Jaokar and Cheuk Ting Ho) are data scientists and AI researchers who have deployed AI applications for Enterprise domains. The book is used as a reference for Ajit and Cheuk's new course on Implementing Enterprise AI.

The table of content is available here.  The book can be accessed here (members only.)

3653893658?profile=RESIZE_710x8. Book: Applied Stochastic Processes

Full title: Applied Stochastic Processes, Chaos Modeling, and Probabilistic Properties of Numeration Systems. Published June 2, 2018. Author: Vincent Granville, PhD. (104 pages, 16 chapters.)

This book is intended to professionals in data science, computer science, operations research, statistics, machine learning, big data, and mathematics. In 100 pages, it covers many new topics, offering a fresh perspective on the subject. It is accessible to practitioners with a two-year college-level exposure to statistics and probability. The compact and tutorial style, featuring many applications (Blockchain, quantum algorithms, HPC, random number generation, cryptography, Fintech, web crawling, statistical testing) with numerous illustrations, is aimed at practitioners, researchers and executives in various quantitative fields.

New ideas, advanced topics, and state-of-the-art research are discussed in simple English, without using jargon or arcane theory. It unifies topics that are usually part of different fields (data science, operations research, dynamical systems, computer science, number theory, probability) broadening the knowledge and interest of the reader in ways that are not found in any other book. This short book contains a large amount of condensed material that would typically be covered in 500 pages in traditional publications. Thanks to cross-references and redundancy, the chapters can be read independently, in random order.

The table of content is available here. The book (PDF) can be accessed here (members only.) 

Read more…

Originally posted here, by Jorge Castanon.

Ancient ruins are sometimes discovered after long years investigating regions of the world covered by dense jungle or giant forests. The feeling of an archaeologist at that moment of discovery gives a window into the feeling data scientists often have when getting a view of their data — through visualizations — that clarifies a key aspect of the analysis.


Ancient ruins are sometimes discovered after long years investigating regions of the world covered by dense jungle or giant forests. The feeling of an archaeologist at that moment of discovery gives a window into the feeling data scientists often have when getting a view of their data — through visualizations — that clarifies a key aspect of the analysis.

For both, it’s a Eureka moment!

Data visualization plays two key roles:

1. Communicating results clearly to a general audience.

2. Organizing a view of data that suggests a new hypothesis or a next step in a project.

It’s no surprise that most people prefer visuals to large tables of numbers. That’s why clearly labeled plots with meaningful interpretation always make it to the front of academic papers.

This post looks at the 10 visualizations you can bring to bear on your data — whether you want to convince the wider world of your theories or crack open your own project and take the next step:

  1. Histograms
  2. Bar/Pie charts
  3. Scatter/Line plots
  4. Time series
  5. Relationship maps
  6. Heat maps
  7. Geo Maps
  8. 3-D Plots
  9. Higher-Dimensional Plots
  10. Word clouds


Let’s start with histograms, which give us an overview of all the possible values of a numerical variable of interest, as well as how often they occur. Simple yet powerful, histograms are sometimes called data distributions.

In visual terms, we draw a frequency table where the variable of interest is binned into ranges on the x-axis and where we show the frequency of the values in each bin on the y-axis.

For example, imagine a company makes its intelligent thermostats more attractive to consumers by offering rebates that vary by zip code. A histogram of the thermostatic rebates helps to understand its range of values, as well as how frequent each value is.

Histogram of Thermostatic Rebates in USD

Note that about half of the thermostat rebates were between $100 and $120. Only a handful of zip codes have rebates over $140 or under $60.

Data source here.

Bar and Pie Charts

Bar (and pie) charts are to categorical variables what histograms are to numerical variables. Both bar and pie charts work best for distributions of variables that can take only a fixed number of values, such as low/normal/high, yes/no, or regular/electric/hybrid.

Bar or pie? It is important to know that bar charts often can be inaccurate visually. Human brains are not particularly good with processing pie charts (read more about this in this article¹).

Too many categories can cause either bar or pie charts to overwhelm the visualization. In that case, consider choosing the top values and visualize only those.

The next example shows both bar and pie charts for medical patients’ blood pressure, by the categories LOW, NORMAL and HIGH.

Bar and Pie Charts for Patient’s Blood Pressure

Data source here.

Scatter and Line Plots

Probably, the simplest charts are scatter plots. They show a two-dimensional (x,y) representation of the data on a cartesian plane, and are especially helpful for inspecting the relationship between two variables, because they let the viewer explore any correlations visually. Line plots are scatter plots but with a line that joins all the dots (frequently used when variable y is continuous).

For example, assume you want to explore how a house’s price relates to its square footage. The next figure shows a scatter plot with house prices on the y-axis and square footage on the x-axis. Note how the plot shows a level of linear correlation between the variables — in general, the more square footage, the higher the price.

I especially like scatter plots because you can extend their dimensionality with color and size. For instance, we could add a third dimension by coloring the dots according to the number of bedrooms in each house.


An easy way to extend scatter plots to 3 or 4 dimensions is to use the color and the size of the bubbles. For instance, if each bubble in the last plot is colored by the number of rooms in each house, we would have a third dimension represented in the chart.

Data source here.

Time Series Plots

Time plots are scatter plots with a time range on the x-axis where each dot forms part of a line — reminding us that time is continuous (though computers aren’t).

Time series plots are great for visually investigating trends, jumps and dumps in data over time, which makes them especially popular for financial and sensor data.

Here, for example, the y-axis represents the daily close price of Tesla stock from 2015 to 2017.

Time Series Plot of Tesla Stock Close Price from 2015–2017

Data source here.

Relationship Charts

If your goal is to develop a comprehensive hypothesis, it can be especially helpful to visually represent the relationships in your data. Imagine you’re a resident scientist at a healthcare company, working on a data science project for helping medical doctors accelerate their prescription decisions. Suppose there are four drugs (A, C, X and Y) and that doctors prescribe one and only one drug to each patient. Your data set includes historical data of patients prescriptions accompanied by patients’ gender, blood pressure and cholesterol.

How are relationship charts interpreted? Each column in the dataset is represented with a different color. The thickness of the lines in the charts represent how important (frequency count) a relationship is between the values of two columns. Let’s look at the example to dig into the interpretation.

A relationship chart of drug prescriptions offers a few insights:

• All patients with high blood pressure were prescribed Drug A.

• All patients with low blood pressure and high cholesterol level were prescribed Drug C.

• None of the patients prescribed Drug X showed high blood pressure.

With those intriguing insights in hand, you can start to formulate a set of hypotheses — and launch new areas of inquiry. For example, a machine learning classifier might work accurately to predict the usage for Drugs A, C and maybe X, but since Drug Y is tied to all the possible feature values, you might need additional features to begin making predictions.

Patient Drug Prescription Relationship Chart

Data source here.

Heat Maps

Another cool and colorful way to bring an additional dimension to a 2-D plot is via heat maps, which use color within a matrix or map display to show frequency or concentration. Most users find heat maps especially intuitive since the color concentration pulls out trends and regions of special interest.

The following image shows a visualization of the Levenshtein distances between movie titles within the IMDB database. The further each movie title is from other titles, the darker it appears in the chart, for example (in terms of Levenshtein distance) Superman is far from Batman Forever, but close to Superman 2.

Credit for this great visualization goes to Michael Zargham².

Heat Map of Distances Between Movie Titles


Like most people, I love maps and can spend hours in apps that use maps to visualize interesting data: Google Maps, Zillow, Snapchat, and more. If your data includes longitude and latitude information (or another way to organize data geographically (zip codes, area codes, county data, airport data, etc.) maps can bring a rich context to your visualizations.

Consider the thermostat rebate example from the earlier Histogram section. Recall that the rebates vary by region. Since the data includes longitude and latitude information, we can display the rebates on a map. Once I assigned a color spectrum from lowest rebate (blue) to highest rebate (red), I could lay the data onto a map of the States:

Thermostats Rebates in USD

Data source here.

Word Clouds

A surprising amount of data available for study occurs as simple free text. As a first pass on this data, we might want to visualize word frequency in the corpus, but histograms and pie charts really do best with frequencies in data that’s numerical rather than verbal. So we can turn instead to word clouds.

With free text data, we can start by filtering out stop words like “a,” “and,” “but,” and “how,” and by standardizing all text to lower case. I often find that there’s additional work to do to clean and shape the data, depending on your goals, including removing diacritical marks, stemming, and so on. Once the data is ready, it’s a quick step to use a word cloud visualization to get a sense of the most common words in the corpus.

Here, I used the Large Movie Reviews Dataset³ to draw a word cloud for the positive reviews and another for the negative reviews.

Word Cloud From Positive Movie Reviews

Word Cloud From Negative Movie Reviews

3-D Plots

It’s becoming increasingly common to visualize 3-D data by adding a third dimension to a scatter plot. These charts typically benefit from interactivity since rotation and resizing can help the user get meaningful views of the data. The next example shows a 2-dimensional Gaussian probability density function, along with a panel of controls for adjusting the view.

2D Gaussian Probability Density Function

Data source here.

Higher Dimensional Plots

With high-dimensional data, we want to visualize the influence of four, five, or more features at one. To do so, we can first project to two or three dimensions, taking advantage of any of the visualization techniques mentioned earlier. For example, imagine adding a third dimension to our thermostat rebate map where each dot were extended into a vertical line that indicated the average energy consumption for that location. Doing so would get us to four dimensions: longitude, latitude, rebate amount, and average energy consumption.

For higher-dimensional data, we often need to reduce the dimensionality using either principal component analysis (PCA) or t-Stochastic Neighbor Embedding (t-SNE).

The most popular dimensionality reduction technique is PCA, which reduces the dimension of the data based on finding new vectors that maximize the linear variation of the data. When the linear correlations of the data are strong, PCA can reduce the dimension of the data dramatically, with little loss of information.

By contrast, t-SNE is a non-linear dimensionality reduction method, which decreases the dimension of the data while approximately preserving the distance between data points in the original high-dimensional space.

Consider this small sample of MNIST⁴ database of handwritten digits. The database contains thousands of images of digits from 0 to 9, which researchers use to test their clustering and classification algorithms. The size of these images is 28 x 28 = 784 pixels, but with t-SNE, we can reduce those 784 dimensions to just two:

t-SNE on MNIST Database of Handwritten Digits

Data source here.

So there you have the ten most common visualization types with meaningful examples for each one. All the visualizations of this blog were done using Watson Studio Desktop. Besides Watson Studio Desktop, definitely consider tools like R, Matplotlib, Seaborn, ggplot, Bokeh, and — to mention just a few.

And best of luck bringing your data to life!

[1] Stephen Few. (August 2007). “Save The Pies For Dessert”.

[2] Michael Zargham, and Jorge Castañón. (2017). A Physics-based Approach for a Data Science CollaborationMedium Post.

[3] Andrew L. Maas, Raymond E. Daly, Peter T. Pham, Dan Huang, Andrew Y. Ng, and Christopher Potts. (2011). Learning Word Vectors for Sentiment Analysis. The 49th Annual Meeting of the Association for Computational Linguistics (ACL 2011).

[4] Yann LeCun and Corinna Cortes. (2010). MNIST handwritten digit database. Available at


Special thanks to Steve Moore for his great feedback on this post.

Jorge Castañón, Ph.D.

Twitter: @castanan
LinkedIn: @jorgecasta

Please find the original blog post here.

Read more…

This chart communicates the same insights as a contour plot. What is interesting is the choice of hexagonal buckets (rather than squares) to aggregate data. In fact, any tessellation would work, in particular Voronoi tessellations.

3-D Voronoi tessellation 

The reason for using hexagons is that it is still pretty simple, and when you rotate the chart by 60 degrees (or a multiple of 60 degrees) you still get the same visualization.  For squares, rotations of 60 degrees don't work, only multiples of 90 degrees work. Is it possible to find a tessellation such that smaller rotations, say 45 or 30 degrees, leave the chart unchanged? The answer is no. Octogonal tessellationsdon't really exist, so the hexagon is an optimum. 

Hexagonal binning plots (source: here)

Implementation in R

The three plots described here (Voronoi diagram, hexagonal binning and contour plots) are available in the ggplot2 package.

  • Hexagonal binning: ggplot function with the parameter stat_binhex, see here
  • Contour plot: ggplot function with the parameters geom_point and geom_density2 or stat_contour, see here  (also works with contour)
  • Voronoi diagram: ggplot with the parameter geom_segment, see here


Voronoi diagrams can be used for nearest neighbor clustering or density estimation, the density estimate attached to a point being proportional to the inverse of the area of the Voronoi polygon containing it.


Example of contour map (source: here)

Originally posted here

Read more…

The post is devoted to select the most popular ad to display on a webpage to gather the most clicks. The rate at which the webpage visitors click on an ad is called a conversion rate for the add.

Assume that we have several ads and a place on a webpage to show one of them. We can display them one by one, record all the clicks, analyse the results afterwards and figure the most popular. But an ad display may be pricey. It would be more effective to estimate rates in real time and to display the most popular one as soon as rates can be compared. Especially if an ad leads to a page for a visitor to buy something. There are couple of method for such estimations: Upper Confidence Bound method and Thompson Sampling method.

The fist one is based on an confidence interval concept which is studied in a Statistics course and has a good intuitive explanation. Roughly speaking a confidence interval is a numeric interval were our value is supposed to lie with some probability, usually 95%. (The real statistical definition is more technical and means not quite this, but it practice the above explanation is close enough.)During our ad displays we can compute average rates at each step with corresponding confidence intervals and pick up for next display an ad with a highest upper confidence bound. You can see how it happens in the video below.

The method has some drawbacks. It does not take into account that our rates must be between 0 and 1, so initial confidence intervals usually are much greater. It means that we loose some time on getting realistic values for our intervals. The worse thing is that if we throw in an additional ad then the process takes a lot of time to recover.

Here is another method which is more efficient, Thompson Sampling Method. It constructs Beta distributions for each ad rate and instead of computing averages draws a random number in accordance with the distribution. There is a picture how it goes for one ad, with a blue vertical line marking a mean and the red line for a random value:

As you see since a random value has more probability to appear were our line is higher, then it get closer and closer to the mean at each step. You might view the area where a curve appears higher than horizontal axis as a confidence interval analogue.

Here how it works for a few ads (I dropped means to make picture more clear):

In addition it accommodates an additional ad in the middle of the process more easily. 


Do not hesitate to ask questions!

(the post originally appeared here: Mya Bakhoava's blog

Read more…

Power BI: Tutorial

Guest blog by Robert Breen.

What is PowerBI?

Power BI is a collection of software services, apps, and connectors that work together to turn your unrelated sources of data into coherent, visually immersive, and interactive insights. Whether your data is a simple Excel spreadsheet, or a collection of cloud-based and on-premises hybrid data warehouses, Power BI lets you easily connect to your data sources, visualize (or discover) what’s important, and share that with anyone or everyone you want.

Power BI can be simple and fast – capable of creating quick insights from an Excel spreadsheet or a local database. But Power BI is also robust and enterprise-grade, ready for extensive modeling and real-time analytics, as well as custom development. So, it can be your personal report and visualization tool, and can also serve as the analytics and decision engine behind group projects, divisions, or entire corporations.


What is Power BI Desktop?

Power BI Desktop is a free application you can install on your local computer that lets you connect to, transform, and visualize your data. With Power BI Desktop, you can connect to multiple different sources of data, and combine them (often called modeling) into a data model that lets you build visuals, and collections of visuals you can share as reports, with other people inside your organization. Most users who work on Business Intelligence projects use Power BI Desktop to create reports, and then use the Power BI service to share their reports with others.

The most common uses for Power BI Desktop are the following:

  • Connect to data
  • Transform and clean that data, to create a data model
  • Create visuals, such as charts or graphs, that provide visual representations of the data
  • Create reports that are collections of visuals, on one or more report pages
  • Share reports with others using the Power BI service

People most often responsible for such tasks are often considered data analysts (sometimes just referred to as analysts) or Business Intelligence professionals (often referred to as report creators). However, many people who don't consider themselves an analyst or a report creator use Power BI Desktop to create compelling reports, or to pull data from various sources and build data models, which they can share with their coworkers and organizations.

With Power BI Desktop you can create complex and visually rich reports, using data from multiple sources, all in one report that you can share with others in your organization.

The steps you need to follow to install the desktop application are:

  • Once you downloaded the file, open and follow the instructions:



Connect to data

To get started with Power BI Desktop, the first step is to connect to data. There are many different data sources you can connect to from Power BI Desktop. To connect to your data, follow the next steps:

  • Select the Home ribbon and then select “Get Data”:

  • Then select your data source:

  • When you select a data type, you're prompted for information, such as the URL and credentials, necessary for Power BI Desktop to connect to the data source on your behalf.

Once you connect to one or more data sources, you may want to transform the data so it's useful for you.


Transform and clean data, create a model

In Power BI Desktop, you can clean and transform data using the built-in Query Editor. With Query Editor you can make changes to your data, such as changing a data type, removing columns, or combining data from multiple sources. It's a little bit like sculpting - you can start with a large block of clay (or data), then shave pieces off or add others as needed, until the shape of the data is how you want it.

If for example, you want to change the format of one column you need to follow these steps:

  • Select the column header:

  • Right-click to show the menu and select the “Change Type” option and then choose the right option for you:

  • You’ll see the results:

Each step you take in transforming data (such as rename a table, transform a data type, or delete columns) is recorded by Query Editor, and each time this query connects to the data source those steps are carried out so that the data is always shaped the way you specified.

The following image shows the Query Settings pane for a query that has been shaped and turned into a model.

Once your data is how you want it, you can create visuals.


Create visuals

Once you have a data model, you can drag fields onto the report canvas to create visuals. A visual is a graphic representation of the data in your model. The following visual shows a simple column chart.

There are many different types of visuals to choose from in Power BI Desktop. To create or change a visual, just select the visual icon from the Visualizations pane. If you have a visual selected on the report canvas, the selected visual changes to the type you selected. If no visual is selected, a new visual is created based on your selection.

To create a new visual, follow the next steps:

  • Choose the appropriate chart from the “Visualizations” pane:


  • Drag your data into the “Axis” and “Value” fields:


  • And you’ll see a chart in your spreadsheet:


You can customize many of your chart fields and labels in the “Format Pane”:

Learn more at

DSC Resources

Read more…

Why Your Brain Needs Data Visualization

Why Your Brain Needs Data Visualization

This is a well-known fact nowadays: a goldfish has higher attention span than an average Internet user. That’s the reason why you’re not interested to read huge paragraphs of text. Research by Nielsen Norman Group showed that Internet users have time to read at most 28% of the words on a web page. Most of them read only 20%. Visual content, on the other hand, has power to hold your attention longer.

If you were just relying on the Internet as a casual user, not reading all the text wouldn’t be a problem. However, when you have a responsibility to process information, things get more complicated. A student, for example, has to read several academic and scientific studies and process a huge volume of data to write a single research paper. 65% of people are visual learners, so they find text difficult to process. The pressuring deadline will eventually lead the student to hiring the best coursework writing service. If they present the data visually, however, they will need less time to process it and gettheir own ideas for the paper.   

Let’s explore some reasons why your brain needs that kind of visualization.

1.     Visual Data Triggers Preattentive Processing

Our low-level visual system needs less than 200-250 milliseconds to accurately detect visual properties. That capacity of the brain is called pre-attentive processing. It is triggered by colors, patterns, and forms. When you use different colors to create data visualization, you emphasize the important details, so those are the elements your eye will first catch. You will use your long-term memory to interpret that data and connect it with information you already know. 

2.     You Need a Visual Tier to Process Large Volumes of Data

When you’re dealing with production or sales, you face a huge volume of data you need to process, compare, and evaluate. If you represented it through a traditional Excel spreadsheet, you would have to invest countless hours looking through the tiny rows of data. Through data visualization, you can interpret the information in a way that makes it ready for your brain to process.

3.     Visual Data Brings Together All Aspects of Memory

The memory functions of our brain are quite complex. We have three aspects of memory : sensory, short term (also known as working memory) and long term. When we first hear, see, touch, taste, or smell something, our senses trigger the sensory memory. While processing information, we preserve it in the working memory for a short period of time. The long-term memory function enables us to preserve information for a very long time.

Visual data presentation connects these three memory functions. When we see the information presented in a visually-attractive way, it triggers our sensory memory and makes it easy for us to process it (working memory). When we process that data, we essentially create a new “long-term memory folder” in our brain.

Data visualization is everywhere. Internet marketing experts understood it, and the most powerful organizations on a global level understood it, too. It’s about time we started implementing it in our own practice.

Read more…
From time to time I keep pondering on what could be the future and I am sure lot of us get this science fiction imagery where the future data analyst will be given just a pair of holographic gloves and perform three dimensional analysis. Let us stop day dreaming and get to the basics. Dashboards have come a long way. These days lot of vendors are catering towards consuming big data etc. but I remember the days when the common target of BI vendors was "Excel". Looks like the focus has totally shifted from "Excel as the enemy" to "Big Data as the elephant".
Read more…

Originally posted on Data Science Central

This infographic came from Medigo. It displays data from the World Health Organization’s “Projections of mortality and causes of death, 2015 and 2030”. The report details all deaths in 2015 by cause and makes predictions for 2030, giving an impression of how global health will develop over the next 14 years. Also featured is data from showing how life expectancy will change between now and 2030.

All percentages shown have been calculated relative to projected changes in population growth.

Read original article here

DSC Resources

Additional Reading

Follow us on Twitter: @DataScienceCtrl | @AnalyticBridge

Read more…

Taxonomy of 3D DataViz

Been trying to pull together a taxonomy of 3D data viz. Biggest difference is I think between allocentric (data moves) and egocentric (you move) viewpoints. The difference between whether you then view/explore the egocentric 3D visualisation on a 2D screen or in a 3D headset is I think a lesser distinction (and actually an HMD is possible less practical in most cases).

We have a related benefits escalator for 2D->3D Dataviz, but again I'm not convinced that "VR" should represent another level on this - its more of an orthogonal element - another way to view the upper tiers.

Care to discuss or extend/expand/improve?

Read more…

3D Data Visualisation Survey

Back in 2012 we released Datascape, a general purpose 3D immersive data visualisation applications. We are now getting ready to release our 2nd generation application – Datascape2XL, which allows you to plot and interact with over 15 million data points in a 3D space, and view them with either a conventional PC screen or an Oculus Rift virtual reality headset (if you must...).

In order to inform our work we have created a survey to examine the current "state of the market" in terms of what applications people are using for data visualisation, how well they are meeting needs, and what users want of a 3D visual analytics application. The survey builds on an earlier survey we did in 2012 and the results of which are still available on our web site.

We will again be producing a survey report for public consumption, which you can sign up to receive at the end and we'll also post up here.

The aim of this survey is to understand current use of, and views on, data visualisation and visual analytics tools. We recognise that this definition can include a wide variety of different application types from simple Excel charts and communications orientated infographics to specialist financial, social media and even intelligence focussed applications. We hope that we have pitched this initial survey at the right level to get feedback from users from across the spectrum.

I hope that you can find 5 minutes to complete the survey - which you can find here:




Read more…

Dataviz with Python

This article was written by Reiichiro Nakano.

There are a number of visualizations that frequently pop up in machine learning. Scikit-plot is a humble attempt to provide aesthetically-challenged programmers (such as myself) the opportunity to generate quick and beautiful graphs and plots with as little boilerplate as possible.

Here's a quick example to generate the precision-recall curves of a Keras classifier on a sample dataset:

# Import what's needed for the Functions API
import matplotlib.pyplot as plt
import scikitplot.plotters as skplt

# This is a Keras classifier. We'll generate probabilities on the test set., y_train, batch_size=64, nb_epoch=10, verbose=2)
probas = keras_clf.predict_proba(X_test, batch_size=64)

# Now plot.
skplt.plot_precision_recall_curve(y_test, probas)

Installation is of the sciplot library is simple! First, make sure you have the dependencies Scikit-Learn and Matplotlib installed.

Then just run:

pip install scikit-plot

Or if you want, clone this repo and run

python install

at the root folder.

Originally posted here.



Read more…

Finding insights with graph analytics

Originally posted here


From detecting anomalies to understanding what are the key elements in a network, or highlighting communities, graph analytics reveal information that would otherwise remain hidden in your data. We will see how to integrate your graph analytics with Linkurious Enterprise to detect and investigate insights in your connected data.


What is graph analytics

Definition and methods


Graph analytics is a set of tools and methods aiming at extracting knowledge from data modeled as a graph. The graph paradigm is ideal to make the best out of connected data, which value resides for the most part in its relationships. But even with data modeled as a graph, extracting knowledge and providing insights can be challenging. Faced with multi-dimensional data and very large datasets, analysts need tools to accelerate the discovery of insights.


The field of graph theory has spawned multiple algorithms that analysts can rely on to find insights hidden in graph data. Below are the some of the popular graph algorithms and how they can help find insights for use-cases such as fraud, network management, anti-money, intelligence analysis or cybersecurity:


  • Pattern matching algorithms allow to identify one or several subgraphs with a given structure within a graph. Example: A company node with the country property containing “Luxembourg” connected to at least five officer nodes with a registered address in France.
  • Traversal and pathfinding algorithms determine paths between nodes within the graph, without knowing what connections exist or how many of them separate the two nodes. In money laundering investigations, path analysis can help determine how money flows through a network of individuals, how it goes from company A to person B. Example: the shortest path algorithm.
  • Connectivity algorithms find the minimum number of nodes or edges that need to be removed to disconnect the remaining nodes from each other. It is helpful to determine weaknesses in an IT network for instance and find out which infrastructure points are sensitive and can take it down. Example: the Strongly Connected Components algorithm
  • Community detection algorithms identify clusters or groups, of nodes densely connected within the graph. This is particularly helpful to find groups of people that might belong to a common criminal organization. Example: the Louvain method, the label propagation algorithm.
  • Centrality algorithms determine a node’s relative importance within a graph by looking at how connected it is to other nodes. It is used for instance to identify key people within organizations. Example: the PageRank algorithm, degree centrality, closeness centrality, betweenness centrality

Architecture blueprint for graph analytics


Depending on your data, your use-case, and the questions you have to answer, technology and infrastructure can differ from one organization to another. But a generic graph analytics architecture usually consists of the following layers:


  • Linkurious Enterprise: the browser-based platform and its server are used by investigation teams to  visualize and analyze graph data. It retrieves data in real-time from graph databases.
  • Graph databases: transactional systems storing data as graphs and managing operations such as data retrieval or writing. They perfectly handle real-time queries, making them great online transaction processing (OLTP) systems.
  • Graph processing systems: a set of analytical engines shipping with common graph algorithms and handling large-scale online analytical processing (OLAP) on graphs.
graph analytics Linkurious schema

Architecture blueprint for graph analytics


Linkurious Enterprise acts as a front-end where analysts and investigators can easily retrieve information. The data accessed by Linkurious Enterprise is stored in a graph database. Graph databases are well suited for real-time querying and long-term persistence but are usually not designed for running complex graph algorithms at scale. As a result, our clients tend to push this sort of workload to dedicated graph processing frameworks such as Spark/GraphX. The results are then persisted back in the graph database as new properties (eg a PageRank score property for example) and thus become available to Linkurious Enterprise.

Applying graph analytics to the Paradise Papers data


In this section, we take a closer look at a real-life graph dataset, the Paradise Papers dataset, created by the ICIJ to investigate the world offshore finance industry. We use Linkurious Enterprise to query, analyze and visualize the data using graph analytics tools and methods.


The setup

Linkurious graph analytics

The setup used in our example


For the purpose of this example, we relied on the architecture pictured above:

The Paradise Papers dataset


The dataset is made of 1,582,953 nodes and 2,398,680 edges. It aggregates data from four investigations of the ICIJ: the Offshore Leaks, the Panama Papers, the Bahamas Leaks and the Paradise Papers.


The graph data model has four types of nodes and three types of edges as depicted below.


Paradise papers linkurious

Graph data model of the Paradise Papers dataset


In the following sections, we will see how to use different graph analytics approaches such as graph pattern matching, PageRank analysis, and the Louvain community detection method. While implementing graph analytics requires some technical knowledge, we will see how Linkurious Enterprise can make graph analytics results accessible to every analyst via simple tools. Among these tools are query templates, an alert dashboard, and a visualization interface.


Graph pattern matching in Linkurious Enterprise


A simple method for identifying patterns in a graph is to use graph languages to describe the shape of the data you are looking for. As a developer, you can do it in the interface of your favorite graph database but also within the Linkurious Enterprise interface.


What if you want to be warned every time a certain graph pattern appears in your data? Via the Linkurious Enterprise alert system, you set up alerts for graph patterns you want to monitor. Every time a new match is detected in the database, it’s recorded and available for users to review. This is useful in a fraud monitoring context for instance where you’d want to be notified when instances of known fraud schemes occur.


In the video below, we set up a new alert in Linkurious Enterprise for a specific pattern. The alert contains a graph query looking for addresses tied to more than five entities or company officers.



Once the alert is saved, users access a match list and can start investigating the results. Below, we review one of the findings from the alert investigation interface. 



When looking at a node representing a company, you may want to know what are all the other companies it is sharing the same addresses with. The answer can be retrieved manually, by expanding and filtering the data. Or it can be retrieved via a graph query, which requires technical skills. With Linkurious Enterprise’ query templates, you can apply pre-formatted graph queries with the click of a button and accelerate your data exploration. Users run query templates by right-clicking on a node in the visualization and choosing the desired template from the menu. 


Below is an example of how to set up a query template. We configure it to retrieve, for a given company officer, all the other officers it is connected to via a shared address or a shared company.



Once the query is configured, users can easily access and run it from the visualization interface to speed up their investigations.



In addition to these features, users can rely on Linkurious Enterprise styling and filtering capabilities to analyze the data faster. Once the results of the query are displayed, styles and filters are essential to refine the results, reduce the noise and highlight the key elements.


In the next section, we see how to automate the identification of unusual companies within the French network using the PageRank algorithm and Linkurious Enterprise’s alert system.


Identifying key nodes with the PageRank algorithm


To use graph algorithms in Linkurious Enterprise, you will first need to run them on your backend and save their results as new properties in your graph database. In this example, we show how to identify key nodes in your network using the PageRank algorithm. This centrality algorithm will compute a score assessing the relative importance of various nodes within a network.

One line of code is enough to run the algorithm in Neo4j and create a new node property, “pagerank_g” with the resulting PageRank score.


// Computation of PageRank
CALL algo.pageRank(null,null,{write:true,writeProperty:’pagerank_g’})


Once this has been added to our graph, we can start exploiting the results in Linkurious Enterprise.

We created a new alert, leveraging the PageRank results. The query is simple: it searches for Entity nodes connected to other nodes (Countries, Officer, Intermediary) located in France. It also collects their PageRank scores and ranks them by order of importance. Every matching sub-graph is recorded by the alert system and can be investigated. By sorting results by their PageRank scores, we can focus our investigation on the most important companies within the French network.


// Detect French entities with a high PageRank


MATCH (a:Entity)-[r]-(b)
WHERE b.countries = “France”
WITH a.pagerank as score, a, COLLECT( distinct r) as r, COLLECT( distinct b) as b, count(b) as degree
RETURN a, score, as name, r, b, degree


In the example below, we review one of the top matches recorded by the alert system. 



In addition to these features, users can rely on Linkurious Enterprise styling and filtering capabilities to analyze the data faster. For instance, it’s possible to size and filter the nodes based on their PageRank score to get a faster understanding of the situations as depicted in the image below.


style and analytics

A size is applied to “location” nodes based on their PageRank score to highlight nodes of importance.

By enriching the data with additional information, the PageRank algorithm helped us focus on nodes of interest. The alert system in Linkurious Enterprise helps us classify the results and provides a user-friendly interface for investigation. In the next section, we see how to detect community of interest with a single click using the Louvain algorithm and the query template system.

Identifying interesting communities via the Louvain modularity


In the example below, we implement the Louvain algorithm to identify communities within our network. We look specifically at communities of company officers based on their relationships. The snippet of code below identifies communities and adds a new property “communityLouvain” property to each node, representing the community it belongs to.


// Computation of Louvain modularity


CALL algo.louvain(
 ‘MATCH (p:Officer) RETURN id(p) as id’,
 ‘MATCH (p1:Officer)-[:OFFICER_OF]->(:Entity)<-[:OFFICER_OF]-(p2:Officer)
  RETURN id(p1) as source, id(p2) as target’,


Then, we leverage the data generated by the algorithm in a query template to retrieve in a click for a given “Officer” node, the other officers belonging to the same community. Instead of manually exploring each of the nodes’ neighbors to identify a potential community, the query template instantly provides an answer the analysts can then refine. Below is the code used in the query template.


//Retrieve the officer nodes who belong to the same community


MATCH (a:Officer)
MATCH p = (a:Officer)-[*..4]-(b:Officer)
WHERE a.communityLouvain = b.communityLouvain


We can now retrieve, in a click, officers of the same community from any given officer in the visualization interface. In the example below, we apply this to Boris Rotemberg, a Russian oligarch, opening an investigation on his close connections. Once the results of the query are displayed, styles and filters are essential to refine the results, reduce the noise and highlight the key elements.



Graph analytics and graph visualization are complementary. The existing graph analytics tools and methods make it possible to extract information from large amounts of connected data, generating valuable insights.


With platforms like Linkurious Enterprise, every user can take advantage of graph analytics from their browser via an intuitive interface. From detecting financial crimes, such as money laundering or tax evasion, to spotting fraud, or fighting organized crime, analysts find the insights they need.

DSC Resources

Follow us: Twitter | Facebook

Read more…

Creating a Great Information Dashboard

Our world is dominated by charts and graphs, from the news showing economic performance or that annoying friend of yours on social media that posted their Strava story for 2017 to show off how far they ran or biked or hiked.  Even the infamous maps of the United States showing various states colored red and blue that people become obsessed with every four years. Charts and graphs are everywhere. Dashboards are where these charts can work together to reach their full potential.  Multiple, related information visualizations working together, where the charts can all be consumed almost simultaneously.   Without any distraction from having to scroll to another part of the window or having to change between screens or tabs in web browsers.  Dashboards are the pinnacle of information presentation systems providing support for organizational decision-making activities. A well-crafted and successful dashboard makes a decision maker an informed decision maker.  
Read more…

Dashboards for Everyone!

No matter the job, most professionals do some level of analysis on their computer.  There are always some data sets that live outside the walls.  Or, some analyses that we know could be performed better in a not-easily-sharable tool such as excel, R, python, SPSS, SAS and so on.

So how do you share your personal analysis with others?  Often times people export the graphs and tables to add into a presentation file.  One of the largest downfalls to this approach is that it can cause versioning and updating nightmares.  

What if I told you that we could avoid all of this with dashboards?  Some of you may say, "Yes, obviously, Laura.  But I don't have a licensed BI tool or BI experts at my disposal!  It's not a realistic scenario for me."  Now in the past, I might've agreed with you.  If you don't have a paid BI tool, it can be tricky.  Free BI tool versions usually require the owner to host the software, or they limit the number of charts, viewers or users using the tool.

However, earlier this year, Google removed a number of restrictions to their free hosted dashboarding software called Google Data Studio.  Because of this, I decided to give the software a test drive and see how accessible it is to the non-BI expert.

Below I will take you through a tutorial that I wrote which should allow anyone to create a Google Data Studio dashboard about US Home Prices.  It should take about 1/2 an hour of your time.  It really is that easy.  So please, have a try and let me know how it goes!

The Tutorial Description

For this tutorial I wanted to use some sample data to make a basic one page dashboard.  It will feature some common dashboard elements such as: text, images, summary metrics, summary tables and maps.  To do so I searched out free data sets and found out that Zillow offers summary data collected through their real estate business. 

Side note: Thank you Zillow, I love when companies share their data! 

I downloaded a number of the data sets that I thought would be interesting to display and did a little data processing to make dashboard creation easier.  From there I set out to make a dashboard without reading any instructions to see how usable it really is.  I have to say, it was easy!  There are some odd beta style behaviors that I outline below, but all in all it is a great solution. 

The Tutorial Steps

1.  Download the sample data set needed to create the sample.  

Note: if you have trouble downloading the file from github, go to the main page and select "Clone or Download" and then "Download Zip" as per the picture below.

2.  Sign up for Google Data Studio

3.  Click "Start a New Report"

4. In the new report, add the file "Zillow_Summary_Data_2017-06.csv" downloaded as part of the zip file from the data set in step 1.

5.  Modify the columns of the data set to ensure that "State" is of type "Geo">"Region" with no aggregation and the remaining columns are type "Numeric" >"Number" with "Average" as the aggregation.

6.  Click "Add to Report".  This will make the data source accessible to your new report.

Now we are ready to start building the report piece by piece.  To make it easier, I have broken up the dashboard content into 5 pieces that can be added.  We will tackle these one by one.

To add each of the components above, you will need to use the Google Data Studio Toolbar on the top navigation.  The image below highlights each of the toolbar items that we will be using.

7.  "A. Text"- Easy street. Let's add some text to the dashboard. Start by clicking the "Text" button highlighted in the toolbar above.  Next, take the cross-hair and drag it over the space you want the text to occupy. Enter your text: "US Home Prices”.  In the “Text Properties” select the size and type.  I’m using size 72 and type “Roboto Condensed".

8. "B. Image"- Easy street part 2.  Now we are simply a pretty picture to the dashboard.  Start by clicking the "Image" button highlighted in the toolbar above.  Take the cross-hair and drag it over the space you want the image to occupy.  Select the image "houseimage.jpg" that you downloaded from the GH repo.

9. "C. Scorecard Values"- Now we get into the real dashboarding exercises through metrics and calculations.  Start by clicking the "Scorecard" button highlighted in the toolbar above.  Take the cross-hair and drag it over the space you want the first scorecard value to occupy. In the “data” tab, Select the data set and appropriate metric.  Start with the values in the image above  In the “style” tab select size 36 with the type "Roboto".

Repeat this for every metric in the "C. Scorecard Values" section.

10. "D. Map" - In this step we get more impressive, but not more difficult.  We implement a map! Start by clicking the "Geo Map" button highlighted in the toolbar above. Take the cross-hair and drag it over the space you want the map to occupy.  Select the data set and appropriate metric as per the values in the image above.

11. "E. List"- Now we are going to list out all values in the Geo Map above ordered by their metric "Average Home Value".  Start by clicking the "Table" button highlighted in the toolbar above.  Take the cross-hair and drag it over the space you want the list to occupy. Select the data set and appropriate metric as per the values in the image above.

12.  Make the Report External and Share.  Click the person + icon in the top right of your screen.  Select "Anyone with the link can view".  Copy the external URL and click done.  Now take that external URL and send to all your friends and family with the subject "Prepare to be amazed".

And there you have it, your dashboard is created and you can share away!

Some Criticisms

As I'm sure was obvious from above, I'm impressed with their offering.  But I do feel it is my duty to outline some oddities I came across.  For example: when you set up your data source, you need to specify ahead of time for each column what type of summary you plan on doing with that value.  If you want to use a chart to display averages, you cannot select this within the chart dynamically, it has to be at the data source.  I find this odd and limiting.  Additionally, the csv import has a 200 column limit and there are some formatting annoyances.  

More Details

Google has recently released the ability to embed dashboards!  See the step by step here.

Final Note

I'm happy that I tried out Google Dash Studio.  While it does not meet my current needs at the enterprise level, I am very impressed at it's applicability and accessibility to the personal user.  I truly believe that anyone could make a dashboard with this tool.  So give it a try, impress your colleagues and mobilize your analysis with Google Dash Studio!

Original document on

Written by Laura Ellis

Read more…

Pictographs are exceptionally good for some types of data. In this post, I show how useful they are for displaying proportions (e.g. rates, percentages, fractions).

Look at the pictograph example on the right. It shows the case fatality rate using colored stick figure icons. These quantities could be just as appropriately shown using pie or bar charts (see above). However, the pictorial representation makes this statistic intuitive: out of every 100 individuals infected with SARS, you can expect 11 to die.

Pictographs have an intrinsic scale

The icons give the pictograph an intrinsic scale. Compare the pictograph (right) to the barchart (below). Both charts show that SARS is 3 times more deadly than pertussis, but the advantage of using a pictograph can be seen when we compare the other diseases. The pictograph clearly shows that the fatality rate for SARS is an order of magnitude bigger than that for smallpox. By contrast, on the bar chart, all we can see in the absence of any labels is that SARs is much bigger than smallpox.

The finer resolution provided by the icons is especially useful for the smaller values. In the bar chart, the much larger fatality rate of SARS makes the variation between the other diseases hard to see. But in the pictograph, it is clear that the smallpox fatality rate is at least double that of malaria.

Pictographs show quantities visually

A well designed pictograph makes quantities easy to read. In the example on the right, the small scale and the large number of icons can potentially cause problems. I avoid this by arranging the icons into 10 by 10 squares. Even without explicitly counting each icon, quantities can be evaluated by comparing the area of the square which is red.

The example on the right shows data labels in order to provide a greater level of detail. However, the main message of the chart – the enormous difference between the severity of different diseases – is effectively conveyed by the icons alone.

You can create your own pictograph or read more content here.


Data from

Author: Carmen Chan

Carmen is a member of the Data Science team at Displayr. She enjoys looking for better ways to manipulate and visualize data. Carmen studied statistics and bioinformatics at the University of New South Wales.

Read more…

Here we ask you to identify which tool was used to produce the following 18 charts: 4 were done with R, 3 with SPSS, 5 with Excel, 2 with Tableau, 1 with Matlab, 1 with Python, 1 with SAS, and 1 with JavaScript. The solution, including for each chart a link to the webpage where it is explained in detail (many times with source code included) can be found here. You need to be a DSC member to access the page with the solution: you can sign-up here.

How do you score? Would this be a good job interview question?

Chart 1

Chart 2

Chart 3

Chart 4

Chart 5

Chart 6

Chart 7

Chart 8

Chart 9

Chart 10

Chart 11

Chart 12

Chart 13

Chart 14

Chart 15

Chart 16

Chart 17

Chart 18


DSC Resources

Additional Reading

Follow us on Twitter: @DataScienceCtrl | @AnalyticBridge

Read more…

This was originally posted here

Deep Learning gets more and more traction. It basically focuses on one section of Machine Learning: Artificial Neural Networks. This article explains why Deep Learning is a game changer in analytics, when to use it, and how Visual Analytics allows business analysts to leverage the analytic models built by a (citizen) data scientist.

What is Deep Learning and Artificial Neural Networks?

Deep Learning is the modern buzzword for artificial neural networks, one of many concepts and algorithms in machine learning to build analytics models. A neural network works similar to what we know from a human brain: You get non-linear interactions as input and transfer them to output. Neural networks leverage continuous learning and increasing knowledge in computational nodes between input and output. A neural network is a supervised algorithm in most cases, which uses historical data sets to learn correlations to predict outputs of future events, e.g. for cross selling or fraud detection. Unsupervised neural networks can be used to find new patterns and anomalies. In some cases, it makes sense to combine supervised and unsupervised algorithms.

Neural Networks are used in research for many decades and includes various sophisticated concepts like Recurrent Neural Network (RNN), Convolutional Neural Network (CNN) or Autoencoder. However, today’s powerful and elastic computing infrastructure in combination with other technologies like graphical processing units (GPU) with thousands of cores allows to do much more powerful computations with a much deeper number of layers. Hence the term “Deep Learning”.

The following picture from TensorFlow Playground shows an easy-to-use environment which includes various test data sets, configuration options and visualizations to learn and understand deep learning and neural networks:

If you want to learn more about the details of Deep Learning and Neural Networks, I recommend the following sources:

  • “The Anatomy of Deep Learning Frameworks”– an article about the basic concepts and components of neural networks
  • TensorFlow Playground to play around with neural networks by yourself hands-on without any coding, also available on Github to build your own customized offline playground
  • Deep Learning Simplified” video series on Youtube with several short, simple explanations of basic concepts, alternative algorithms and some frameworks like or Tensorflow

While Deep Learning is getting more and more traction, it is not the silver bullet for every scenario.

When (not) to use Deep Learning?

Deep Learning enables many new possibilities which were not possible in “mass production” a few years ago, e.g. image classification, object recognition, speech translation or natural language processing (NLP) in much more sophisticated ways than without Deep Learning. A key benefit is the automated feature engineering, which costs a lot of time and efforts with most other machine learning alternatives. 

You can also leverage Deep Learning to make better decisions, increase revenue or reduce risk for existing (“already solved”) problems instead of using other machine learning algorithms. Examples include risk calculation, fraud detection, cross selling and predictive maintenance.

However, note that Deep Learning has a few important drawbacks:

  • Very expensive, i.e. slow and compute-intensive; training a deep learning model often takes days or weeks, execution also takes more time than most other algorithms.
  • Hard to interpret: lack of understandability of the result of the analytic model; often a key requirement for legal or compliance regularities
  • Tends to overfitting, and therefore needs regularization

Deep Learning is ideal for complex problems. It can also outperform other algorithms in moderate problems. Deep Learning should not be used for simple problems. Other algorithms like logistic regression or decision trees can solve these problems easier and faster.

Open Source Deep Learning Frameworks

Neural networks are mostly adopted using one of various open source implementations. Various mature deep learning frameworks are available for different programming languages.

The following picture shows an overview of open source deep learning frameworks and evaluates several characteristics:

These frameworks have in common that they are built for data scientists, i.e. personas with experience in programming, statistics, mathematics and machine learning. Note that writing the source code is not a big task. Typically, only a few lines of codes are needed to build an analytic model. This is completely different from other development tasks like building a web application, where you write hundreds or thousands of lines of code. In Deep Learning – and Data Science in general – it is most important to understand the concepts behind the code to build a good analytic model.

Some nice open source tools  like KNIME or RapidMinerallow visual coding to speed up development and also encourage citizen data scientists (i.e. people with less experience) to learn the concepts and build deep networks. These tools use own deep learning implementations or other open source libraries like or DeepLearning4j as embedded framework under the hood.

If you do not want to build your own model or leverage existing pre-trained models for common deep learning tasks, you might also take a look at the offerings from the big cloud providers, e.g. AWS Polly for Text-to-Speech translation, Google Vision API for Image Content Analysis, or Microsoft’s Bot Framework to build chat bots. The tech giants have years of experience with analysing text, speech, pictures and videos and offer their experience in sophisticated analytic models as a cloud service; pay-as-you-go. You can also improve these existing models with your own data, e.g. train and improve a generic picture recognition model with pictures of your specific industry or scenario.

Deep Learning in Conjunction with Visual Analytics

No matter if you want to use “just” a framework in your favourite programming language or a visual coding tool: You need to be able to make decisions based on the built neural network. This is where visual analytics comes into play. In short, visual analytics allows any persona to make data-driven decisions instead of listening to gut feeling when analysing complex data sets. See “Using Visual Analytics for Better Decisions – An Online Guide” to understand the key benefits in more detail.

A business analyst does not understand anything about deep learning, but just leverages the integrated analytic model to answer its business questions. The analytic model is applied under the hood when the business analyst changes some parameters, features or data sets. Though, visual analytics should also be used by the (citizen) data scientist to build the neural network. See “How to Avoid the Anti-Pattern in Analytics: Three Keys for Machine ...” to understand in more details how technical and non-technical people should work together using visual analytics to build neural networks, which help solving business problems. Even some parts of data preparation are best done within visual analytics tooling.

From a technical perspective, Deep Learning frameworks (and in a similar way any other Machine Learning frameworks, of course) can be integrated into visual analytics tooling in different ways. The following list includes a TIBCO Spotfire example for each alternative:

  • Embedded Analytics: Implemented directly within the analytics tool (self-implementation or “OEM”); can be used by the business analyst without any knowledge about machine learning (Spotfire: Clustering via some basic, simple configuration of a input and output data plus cluster size)
  • Native Integration: Connectors to directly access external deep learning clusters. (Spotfire: TERR to use R’s machine learning libraries, KNIME connector to directly integrate with external tooling)
  • Framework API: Access via a Wrapper API in different programming languages. For example, you could integrate MXNet via R or TensorFlow via Python into your visual analytics tooling. This option can always be used and is appropriate if no native integration or connector is available. (Spotfire: MXNet’s R interface via Spotfire’s TERR Integration for using any R library)
  • Integrated as Service via an Analytics Server: Connect external deep learning clusters indirectly via a server-side component of the analytics tool; different frameworks can be accessed by the analytics tool in a similar fashion (Spotfire: Statistics Server for external analytics tools like SAS or Matlab)
  • Cloud Service: Access pre-trained models for common deep learning specific tasks like image recognition, voice recognition or text processing. Not appropriate for very specific, individual business problems of an enterprise. (Spotfire: Call public deep learning services like image recognition, speech translation, or Chat Bot from AWS, Azure, IBM, Google via REST service through Spotfire’s TERR / R interface)

All options have in common that you need to add configuration of some hyper-parameters, i.e. “high level” parameters like problem type, feature selection or regularization level. Depending on the integration option, this can be very technical and low level, or simplified and less flexible using terms which the business analyst understands. 

Deep Learning Example: Autoencoder Template for TIBCO Spotfire

Let’s take one specific category of neural networks as example: Autoencoders to find anomalies. Autoencoder is an unsupervised neural network used to replicate the input dataset by restricting the number of hidden layers in a neural network. A reconstruction error is generated upon prediction. The higher the reconstruction error, the higher is the possibility of that data point being an anomaly.

Use Cases for Autoencoders include fighting financial crime, monitoring equipment sensors, healthcare claims fraud, or detecting manufacturing defects. A generic TIBCO Spotfire template is available in the TIBCO Community for free. You can simply add your data set and leverage the template to find anomalies using Autoencoders – without any complex configuration or even coding. Under the hood, the template uses’s deep learning implementation and its R API. It runs in a local instance on the machine where to run Spotfire. You can also take a look at the R code, but this is not needed to use the template at all and therefore optional.

Real World Example: Anomaly Detection for Predictive Maintenance

Let’s use the Autoencoder for a real-world example. In telco, you have to analyse the infrastructure continuously to find problems and issues within the network. Best before the failure happens so that you can fix it before the customer even notices the problem. Take a look at the following picture, which shows historical data of a telco network:

The orange dots are spikes which occur as first indication of a technical problem in the infrastructure. The red dots show a constant failure where mechanics have to replace parts of the network because it does not work anymore.

Autoencoders can be used to detect network issues before they actually happen. TIBCO Spotfire is uses H2O’s autoencoder in the background to find the anomalies. As discussed before, the source code is relative scarce. Here is the snipped of building the analytic model with H2O’s Deep Learning R API and detecting the anomalies (by finding out the reconstruction error of the Autoencoder):

This analytic model – built by the data scientist – is integrated into TIBCO Spotfire. The business analyst is able to visually analyse the historical data and the insights of the Autoencoder. This combination allows data scientists and business analysts to work together fluently. It was never easier to implement predictive maintenance and create huge business value by reducing risk and costs.

Apply Analytic Models to Real Time Processing with Streaming Analytics

This article focuses on building deep learning models with Data Science Frameworks and Visual Analytics. Key for success in projects is to apply the build analytic model to new events in real time to add business value like increasing revenue, reducing cost or reducing risk.

“How to Apply Machine Learning to Event Processing” describes in more detail how to apply analytic models to real time processing. Or watch the corresponding video recording leveraging TIBCO StreamBase to apply some H2O models in real time. Finally, I can recommend to learn about various streaming analytics frameworks to apply analytic models.

Let’s come back to the Autoencoder use case to realize predictive maintenance in telcos. In TIBCO StreamBase, you can easily apply the built H2O Autoencoder model without any redevelopment via StreamBase’ H2O connector. You just attach the Java code generated by H2O framework, which contains the analytic model and compiles to very performant JVM bytecode:

The most important lesson learned: Think about the execution requirements before building the analytic model. What performance do you need regarding latency? How many events do you need to process per minute, second or millisecond? Do you need to distribute the analytic model to a clusters with many nodes? How often do you have to improve and redeploy the analytic model? You need to answer these questions at the beginning of your project to avoid double efforts and redevelopment of analytic models!

Another important fact is that analytic models do not always need “real time processing” in terms of very fast and / or frequent model execution. In the above telco example, these spikes and failures might happen in subsequent days or even weeks. Thus, in many use cases, it is fine to apply an analytic model once a day or week instead of just every second to every new event, therefore.

Deep Learning + Visual Analytics + Streaming Analytics = Next Generation Big Data Success Stories

Deep Learning allows to solve many well understood problems like cross selling, fraud detection or predictive maintenance in a more efficient way. In addition, you can solve additional scenarios, which were not possible to solve before, like accurate and efficient object detection or speech-to-text translation.

Visual Analytics is a key component in Deep Learning projects to be successful. It eases the development of deep neural networks by (citizen) data scientists and allows business analysts to leverage these analytic models to find new insights and patterns.

Today, (citizen) data scientists use programming languages like R or Python, deep learning frameworks like Theano, TensorFlow, MXNet or H2O’s Deep Water and a visual analytics tool like TIBCO Spotfire to build deep neural networks. The analytic model is embedded into a view for the business analyst to leverage it without knowing the technology details.

In the future, visual analytics tools might embed neural network features like they already embed other machine learning features like clustering or logistic regression today. This will allow business analysts to leverage Deep Learning without the help of a data scientist and be appropriate for simpler use cases.

However, do not forget that building an analytic model to find insights is just the first part of a project. Deploying it to real time afterwards is as important as second step. Good integration between tooling for finding insights and applying insights to new events can improve time-to-market and model quality in data science projects significantly. The development lifecycle is a continuous closed loop. The analytic model needs to be validated and rebuild in certain sequences.

Read more…

7 Visualizations You Should Learn in R

This blog was originally posted here

 With ever increasing volume of data, it is impossible to tell stories without visualizations. Data visualization is an art of how to turn numbers into useful knowledge.

R Programming lets you learn this art by offering a set of inbuilt functions and libraries to build visualizations and present data. Before the technical implementations of the visualization, let’s see first how to select the right chart type.

Selecting the Right Chart Type

There are four basic presentation types:

  1. Comparison
  2. Composition
  3. Distribution
  4. Relationship

To determine which amongst these is best suited for your data, I suggest you should answer a few questions like,

  • How many variables do you want to show in a single chart?
  • How many data points will you display for each variable?
  • Will you display values over a period of time, or among items or groups?

Below is a great explanation on selecting a right chart type by Dr. Andrew Abela.

In your day-to-day activities, you’ll come across the below listed 7 charts most of the time.

  1. Scatter Plot
  2. Histogram
  3. Bar & Stack Bar Chart
  4. Box Plot
  5. Area Chart
  6. Heat Map
  7. Correlogram

To learn about the 7 charts listed above, click here. For more articles about R, click here.

Read more…

Kim versus Donald in one Picture

How do you convey a powerful message in just one picture? 

DSC Resources

Popular Articles

Read more…

Originally posted on Data Science Central

This infographic on Shopper Marketing was created by Steve Hashman and his team. Steve is Director at Exponential Solutions (The CUBE) Marketing. 

Shopper marketing focuses on the customer in and at the point of purchase. It is an integrated and strategic approach to a customer’s in-store experience which is seen as a driver of both sales and brand equity.

For more information, click here

DSC Resources

Additional Reading

Follow us on Twitter: @DataScienceCtrl | @AnalyticBridge

Read more…

Featured Blog Posts - DSC

Webinar Series

Follow Us

@DataScienceCtrl | RSS Feeds