Subscribe to our Newsletter

All Posts (218)

Time Series Analysis using R-Forecast package

Guest blog post by suresh kumar Gorakala

In today’s blog post, we shall look into time series analysis using R package – forecast. Objective of the post will be explaining the different methods available in forecast package which can be applied while dealing with time series analysis/forecasting.

What is Time Series?

A time series is a collection of observations of well-defined data items obtained through repeated measurements over time. For example, measuring the value of retail sales each month of the year would comprise a time series.


  • Identify patterns in the data – stationarity/non-stationarity. 
  • Prediction from previous patterns.

Time series Analysis in R:

My data set contains data of Sales of CARS from Jan-2008 to Dec 2013.

Problem Statement: Forecast sales for 2013


















Table: shows the first row data from Jan 2008 to Dec 2012


The forecasts of the timeseries data will be:

Assuming that the data sources for the analysis are finalized and cleansing of the data is done, for further details,

Step1: Understand the data: 

As a first step, Understand the data visually, for this purpose, the data is converted to time series object using ts(), and plotted visually using plot() functions available in R.

ts = ts(t(data[,7:66])) 


Image above shows the monthly sales of an automobile

Forecast package & methods:

Forecast package is written by Rob J Hyndman and is available from CRAN here. The package contains Methods and tools for displaying and analyzing univariate time series forecasts including exponential smoothing via state space models and automatic ARIMA modelling.

Before going into more accurate Forecasting functions for Time series, let us do some basic forecasts using Meanf(), naïve(), random walk with drift – rwf() methods. Though these may not give us proper results but we can use the results as bench marks.

All these forecasting models returns objects which contain original series, point forecasts, forecasting methods used residuals. Below functions shows three methods & their plots.


mf = meanf(ts[,1],h=12,level=c(90,95),fan=FALSE,lambda=NULL)



mn = naive(ts[,1],h=12,level=c(90,95),fan=FALSE,lambda=NULL) 



md = rwf(ts[,1],h=12,drift=T,level=c(90,95),fan=FALSE,lambda=NULL) 


Measuring accuracy:

 Once the model has been generated the accuracy of the model can tested using accuracy(). The Accuracy function returns MASE value which can be used to measure the accuracy of the model. The best model is chosen from the below results which gives have relatively lesser values of ME,RMSE,MAE,MPE,MAPE,MASE.

> accuracy(md)

                                         ME     RMSE       MAE          MPE    MAPE     MASE

Training set      1.806244e-16 2.445734 1.889687 -41.68388 79.67588 1.197689


                                        ME      RMSE        MAE         MPE     MAPE MASE

Training set        1.55489e-16  1.903214 1.577778 -45.03219 72.00485         1

> accuracy(mn)

                              ME   RMSE       MAE         MPE      MAPE     MASE

Training set 0.1355932 2.44949 1.864407 -36.45951 76.98682 1.181666

 Step2: Time Series Analysis Approach:

A typical time-series analysis involves below steps:

  • Check for identifying under lying patterns - Stationary & non-stationary, seasonality, trend. 
  • After the patterns have been identified, if needed apply Transformations to the data – based on Seasonality/trends appeared in the data.
  • Apply forecast() the future values using Proper ARIMA model obtained by auto.arima() methods.

Identify Stationarity/Non-Stationarity:

A stationary time series is one whose properties do not depend on the time at which the series is observed. Time series with trends, or with seasonality, are not stationary.

The stationarity /non-stationarity of the data can be known by applying Unit Root Tests - augmented Dickey–Fuller test (ADF), Kwiatkowski-Phillips-Schmidt-Shin (KPSS) test.

ADF: The null-hypothesis for an ADF test is that the data are non-stationary. So large p-values are indicative of non-stationarity, and small p-values suggest stationarity. Using the usual 5% threshold, differencing is required if the p-value is greater than 0.05.


adf = adf.test(ts[,1])


        Augmented Dickey-Fuller Test

data:  ts[, 1]

Dickey-Fuller = -4.8228, Lag order = 3, p-value = 0.01

alternative hypothesis: stationary

The above figure suggests us that the data is of stationary and we can go ahead with ARIMA models.


KPSS: Another popular unit root test is the Kwiatkowski-Phillips-Schmidt-Shin (KPSS) test. This reverses the hypotheses, so the null-hypothesis is that the data are stationary. In this case, small p-values (e.g., less than 0.05) suggest that differencing is required.

kpss = kpss.test(ts[,1])

Warning message:

In kpss.test(ts[, 1]) : p-value greater than printed p-value


        KPSS Test for Level Stationarity

data:  ts[, 1]

KPSS Level = 0.1399, Truncation lag parameter = 1, p-value = 0.1


Based on the unit test results we identify whether the data is stationary or not. If the data is stationary then we choose optimal ARIMA models and forecasts the future intervals. If the data is non- stationary, then we use Differencing - computing the differences between consecutive observations. Use ndiffs(),diff() functions to find the number of times differencing needed for the data &  to difference the data respectively.


[1] 1

diff_data = diff(ts[,1])

Time Series:

Start = 2

End = 60

Frequency = 1

 [1]  1  5 -3 -1 -1  0  3  1  0 -4  4 -5  0  0  1  1  0  1  0  0  2 -5  3 -2  2  1 -3  0  3  0  2 -1 -5  3 -1

[36] -1  2 -1 -1  5 -2  0  2 -2 -4  0  3  1 -1  0  0  0 -2  2 -3  4 -3  2  5


Now retest for stationarity by applying acf()/kpss() functions if the plots shows us the Stationarity then Go ahead by applying ARIMA Models.

Identify Seasonality/Trend:

The seasonality in the data can be obtained by the stl()when plotted

Stl = Stl(ts[,1],s.window=”periodic”)

Series is not period or has less than two periods


Since my data doesn’t contain any seasonal behavior I will not touch the Seasonality part.

ARIMA Models:

For forecasting stationary time series data we need to choose an optimal ARIMA model (p,d,q). For this we can use auto.arima() function which can choose optimal (p,d,q) value and return us. Know more about ARIMA from here.


Series: ts[, 2]

ARIMA(3,1,1) with drift        


          ar1      ar2      ar3      ma1   drift

      -0.2621  -0.1223  -0.2324  -0.7825  0.2806

s.e.   0.2264   0.2234   0.1798   0.2333  0.1316

sigma^2 estimated as 41.64:  log likelihood=-190.85

AIC=393.7   AICc=395.31   BIC=406.16


Forecast time series:


Now we use forecast() method to forecast the future events.

forecast(auto.arima(dif_data))   Point Forecast     Lo 80      Hi 80     Lo 95    Hi 9561   -3.076531531 -5.889584 -0.2634795 -7.378723 1.22566062    0.231773625 -2.924279  3.3878266 -4.594993 5.05854063    0.702386360 -2.453745  3.8585175 -4.124500 5.52927264   -0.419069906 -3.599551  2.7614107 -5.283195 4.44505565    0.025888991 -3.160496  3.2122736 -4.847266 4.89904466    0.098565814 -3.087825  3.2849562 -4.774598 4.97172967   -0.057038778 -3.243900  3.1298229 -4.930923 4.81684668    0.002733053 -3.184237  3.1897028 -4.871317 4.87678369    0.013817766 -3.173152  3.2007878 -4.860232 4.88786870   -0.007757195 -3.194736  3.1792219 -4.881821 4.866307


The below flow chart will give us a summary of the time series ARIMA models approach:

The above flow diagram explains the steps to be followed for a time series forecasting

Please visit my blog dataperspective for more articles

Read more…

15 Questions All R Users Have About Plots

Guest blog post by Bill Vorhies

Posted by DataCamp July 30th, 2015.

See the full blog here

R allows you to create different plot types, ranging from the basic graph types like density plots, dot plots, bar charts, line charts, pie charts, boxplots and scatter plots, to the more statistically complex types of graphs such as probability plots, mosaic plots and correlograms.

In addition, R is pretty known for its data visualization capabilities: it allows you to go from producing basic graphs with little customization to plotting advanced graphs with full-blown customization in combination with interactive graphics. Nevertheless, not always do we get the results that we want for our R plots:

Here’s a quick list of what’s included:

1. How To Draw An Empty R Plot?

  • How To Open A New Plot Frame
  • How To Set Up The Measurements Of The Graphics Window
  • How To Draw An Actual Empty Plot

2. How To Set The Axis Labels And Title Of The R Plots?

  • How To Name Axes (With Up- Or Subscripts) And Put A Title To An R Plot?
  • How To Adjust The Appearance Of The Axes’ Labels
  • How To Remove A Plot’s Axis Labels And Annotations
  • How To Rotate A Plot’s Axis Labels
  • How To Move The Axis Labels Of Your R Plot

3. How To Add And Change The Spacing Of The Tick Marks Of Your R Plot

  • How To Change The Spacing Of The Tick Marks Of Your R Plot
  • How To Add Minor Tick Marks To An R Plot

4. How To Create Two Different X- or Y-axes

5. How To Add Or Change The R Plot’s Legend?

  • Adding And Changing An R Plot’s Legend With Basic R
  • How To Add And Change An R Plot’s Legend And Labels In ggplot2

6. How To Draw A Grid In Your R Plot?

  • Drawing A Grid In Your R Plot With Basic R
  • Drawing A Grid In An R Plot With ggplot2

7. How To Draw A Plot With A PNG As Background?

8. How To Adjust The Size Of Points In An R Plot?

  • Adjusting The Size Of Points In An R Plot With Basic R
  • Adjusting The Size Of Points In Your R Plot With ggplot2

9. How To Fit A Smooth Curve To Your R Data

10. How To Add Error Bars In An R Plot

  • Drawing Error Bars With Basic R
  • Drawing Error Bars With ggplot2
  • Error Bars Representing Standard Error Of Mean
  • Error Bars Representing Confidence Intervals
  • Error Bars Representing The Standard Deviation

11. How To Save A Plot As An Image On Disc

12. How To Plot Two R Plots Next To Each Other?

  • How To Plot Two Plots Side By Side Using Basic R
  • How To Plot Two Plots Next To Each Other Using ggplot2
  • How To Plot More Plots Side By Side Using gridExtra
  • How To Plot More Plots Side By Side Using lattice
  • Plotting Plots Next To Each Other With gridBase

13. How To Plot Multiple Lines Or Points?

  • Using Basic R To Plot Multiple Lines Or Points In The Same R Plot
  • Using ggplot2 To Plot Multiple Lines Or Points In One R Plot

14. How To Fix The Aspect Ratio For Your R Plots

  • Adjusting The Aspect Ratio With Basic R
  • Adjusting The Aspect Ratio For Your Plots With ggplot2
  • Adjusting The Aspect Ratio For Your Plots With MASS

15. What Is The Function Of hjust And vjust In ggplot2?

Read more…

Here we ask you to identify which tool was used to produce the following 18 charts: 4 were done with R, 3 with SPSS, 5 with Excel, 2 with Tableau, 1 with Matlab, 1 with Python, 1 with SAS, and 1 with JavaScript. The solution, including for each chart a link to the webpage where it is explained in detail (many times with source code included) can be found here. You need to be a DSC member to access the page with the solution: you can sign-up here.

How do you score? Would this be a good job interview question?

Chart 1

Chart 2

Chart 3

Chart 4

Chart 5

Chart 6

Chart 7

Chart 8

Chart 9

Chart 10

Chart 11

Chart 12

Chart 13

Chart 14

Chart 15

Chart 16

Chart 17

Chart 18

DSC Resources

Additional Reading

Follow us on Twitter: @DataScienceCtrl | @AnalyticBridge

Read more…

Big Data In Banking [infographic]

The Banking industry generates a large volume of data on a day to day basis. To differentiate itself from the competition, banks  are increasing adoption big data analytics as part of their core strategy. Analytics will be the critical game changer for the banks. In this infographic we explore the scale at which banks have adopted analytics in their business.

Originally posted on Data Science Central

Read more…

Making Data Visualization Work

Eye Detail

In commercial terms, how we perceive information determines how we process, interpret and action it. Our brains are wired to process visual information and how we do this ensures we are leveraging Data Visualization to full effect.

The processes behind Data Analytics mirror the brain’s own functions. Initially your brain processes visual stimulus through the retina, then to the thalamus, then the primary visual cortex and the association cortex. At each stage, there are filters that our brains apply to determine whether this information is relevant enough to continue processing. We call this ‘rubbish in’ ‘rubbish out.’ Because it is important for information to be understood quickly and easily this where Data Visualization comes into its element.

Using the Right Chart for the Right Job

Colour Wheel 2

Pre-attentive processing occurs within the first 200 milliseconds of seeing a visual. Colour, form and pattern are discernible during this phase. This is why spotting a red jelly bean in a bowl of white jelly beans is really easy.

Bar Charts illustrate a snapshot of the information better than line charts, allowing you to make a split second assessment of the value of what you are seeing. Then comes the fun part….using the correct colours to create a story enables you to emphasise information through a universally acknowledged cognitive alphabet. Red…danger, Blue…all is well, Green…growth and action, Yellow…of interest…Pastel Tones are more soothing on the eye…and so forth.

Because we discovered the world through colour and shape, our long term memory allows us to interpret Visual Data with split second clarity.

To get the best use out of colour, when building a Dashboard theme, follow a line of colours around the spectrum for tone on tone harmony.

Memory is the Key to Data Visualization

Psychologists put it like this; we have 3 memory components. Sensory, Working [Short Term] and Long Term. How we use them is based on a push/pull & slow / fast processing system.

Slow processes information in the present. What is 73 x 62? Fast dips into the pre-programmed paradigms and draws fast conclusions based on experience patterns of knowledge. What is 2×2?

The Sensory Register is the component of memory that holds input in its original, unencoded form. Probably everything that your body is capable of seeing, hearing is stored here. In other words, the sensory register has a large capacity; it can hold a great deal of information at one time.

Working Memory [short-term-memory], is the component of memory where new information is held while it is being processed; in other words, it is a temporary “holding bin” for new information. Working memory is also where much of our thinking, or cognitive processing, occurs. It is where we try to make sense of say this blog or solve a problem. Generally speaking, working memory is the component that probably does most of the heavy lifting. It has two characteristics that Data Analytics works around: a short duration span and a limited capacity.

Long-Term Memory is the Hall of Permanent Record. Long-term-memory is capable of holding as much information as an individual needs to store there; there is probably no such thing as a person “running out of room.” The more information already stored in long-term memory, the easier it is to learn new things.

Data Analytics brings together components of memory function and interconnects the relations through holding patterns. So when we process Analytics and Act on them, we essentially create a ‘new permanent record.’

This is what makes us smarter, faster and efficient.

Creating your Optimized Business Candy with Data Visualization

When building an Analytics Dashboard consider; The Story you are trying to build and the Questions you want answered. Encourage discovery of Hidden elements and pull together Relevant information. Present clear Relationships between 1 data set and another and choose the Correct Chart to illustrate your scenario. Pay particular attention to colour, objects, shapes, patterns and amount of information.

Quick Study: Line or Bar?

  • Lines Graphs: Demonstrate the continual nature of data and pattern.
  • Bar Charts: Illustrate value and variables, prominent attributes and ranking.
  • Bar Stacks: Show the values, contribution and ratio in blocks.
  • Percentages: Shift  emphasis from quantity to relative differences.
  • Cumulative: Summarise  all variables along a timeline.


AnyData works with natural brain tech to bring Data and Business together. Get the most out of Data Visualization by visiting our Learning Centre and watching the How To Videos.

Originally posted on Big Data News

Read more…

Big Data Falls Off the Hype Cycle

Guest blog post by Bill Vorhies

Summary:  Gartner drops “Big Data” from the Hype Cycle for Advanced Analytics and Data Science?  What’s going on?

It is with heavy heart that I must relay to you that Gartner has dropped “Big Data” from its 2015 Hype Cycle for Advanced Analytics and Data Science.  As recently as 2012 this category was called the “The Hype Cycle for Big Data”, but alas, no more.  RIP “Big Data”.

“Big Data” joins other trends dropped into obscurity this year including:  decision management, autonomous vehicles, prediction markets, and in-memory analytics.  Why are terms dropped?  Sometimes because they are too obvious.  For example in-memory analytics was dropped because no one was actually pursuing out-of-memory analytics.  Autonomous vehicles because “it will not impact even a tiny fraction of the intended audience in its day-to-day jobs”.  Some die and are forgotten because they are deemed to have become obsolete before they could grow to maturity.  And Big Data, well, per Gartner “data is the key to all of our discussion, regardless of whether we call it "big data" or "smart data." We know we have to care, so it is moot to make an extra point of it here.”

Is this a joke?  No, not at all.  Actually “Big Data” remains an item in the 2014 Hype Cycle for Emerging Technologies and even the 2105 Hype Cycle for the Internet of Things but that doesn’t really clear things up.  Because right alongside of “Big Data” on those charts are at least a dozen terms drawn directly from Advanced Analytics and Data Science such as predictive analytics, real-time analytics, data science, and content analytics.

OK so there’s plenty of inconsistency among these closely related hype cycles and we’re not about to abandon “Big Data”.  If we did half of us would have to change our company names.  But the truth is that “Big Data” was always a little more than inconvenient.

Start by trying to create a simple definition.  Oh, it’s a little like really big data and a little like really fast data and a little like a whole lot of different types of data.  And that’s only if you stick with the first three Vs.  No wonder our audience is sometimes confused.

When I first took a stab at making a definition I concluded that Big Data was really more about a new technology in search of a problem to solve.  That technology was NoSQL DBs and it could solve problems in all three of those Vs.  Maybe we should have just called it NoSQL and let it go at that.

Not to worry.  I’m sure that calling things “Big Data” will stick around for a long time even if Gartner wants us not to.  There’s an old saying that you die twice.  Once when you pass away and once when the last person who remembers you utters your name.  Based on that criterion I’m guessing “Big Data” is in for a long, long life.


August 17, 2015

Bill Vorhies, President & Chief Data Scientist – Data-Magnum - © 2015, all rights reserved.


About the author:  Bill Vorhies is President & Chief Data Scientist at Data-Magnum and has practiced as a data scientist and commercial predictive modeler since 2001.  Bill is also Editorial Director for Data Science Central.  He can be reached at:

[email protected] or [email protected]

Read more…

Cheat Sheet: Data Visualization in Python


Guest blog post by Mirko Krivanek

Below is a Python for Visualization cheat sheet, originally published here as an infographics. Other cheat sheets about Data Science, Python, Visualization, and R, can be found here. Here are additional resources


DSC Resources

Additional Reading

Follow us on Twitter: @DataScienceCtrl | @AnalyticBridge

Read more…

Guest infographics, originally posted on Udemy. It explains the concept, alternatives and career advice.

DSC Resources

Additional Reading

Follow us on Twitter: @DataScienceCtrl | @AnalyticBridge

Read more…

BigQuery Big Data Visualization With D3.js

How to handle large dataset with D3.js?

It’s a frequently asked question. You can read several discussions on the topic here,here, and here. So far, the best solution is to process data to a smaller dataset. Then use D3.js to visualize.

With carefully crafted data processing, we can get decent story from data. But this solution doesn’t provide a lot of flexibility to experiment with data on the fly. We need a more streamlined workflow. Less friction can spark interesting data innovation.

Google BigQuery a great tool to handle big dataset. It’s definitely going to help us handle big dataset for D3.js.

I will use New York Taxi dataset hosted on Google BigQuery. It is 4+ GB and has more than 350 million rows in 2 tables. In this article, I want to show you how to query it on the fly. Then use D3.js to create a line chart of total trip amount over time. You can explore the dataset here:

(You’ll need to setup BigQuery account with one project to see public table)

BigQuery has full SQL support. So we can run aggregate query directly on dataset. We’ll group by month/year and sum total_amount column. It takes less than 5 seconds.

SELECT CONCAT(CONCAT(STRING(MONTH(TIMESTAMP(pickup_datetime))), "/"), STRING(YEAR(TIMESTAMP(pickup_datetime)))) AS time, SUM(INTEGER(total_amount)) AS total_amount FROM [833682135931:nyctaxi.trip_fare] GROUP BY time;

Then you can use Javascript client library to query for your dataset on the fly. See full article and code here:

Finally, you'll have a visualization that gets data directly from Google BigQuery.

Phuoc Do

Originally posted on Data Science Central

Read more…

Data Analysis Using Relationship Graphs


There are four key data visualization techniques used by data analysis pros in the government and local law enforcement.  As financial institutions, e-commerce organizations and social network analysts begin to apply data visualization more frequently, these techniques will help guide the process of uncovering meaningful insights hidden within mountains of disparate data.  This post focuses on advanced data visualization using relationship graphs.

In our last post ("Four Key Data Visualization Techniques Used by the Pros"), we mentioned four important techniques in data visualization.  They are:


1)   Data Preparation & Data Connectivity

2)   Data Profiling

3)   Advanced Analysis Using Relationship Graphs

4)   Annotation, Collaboration and Presentation


We summarized the key aspects of data profiling, especially as they relate to uncovering data anomalies prior to advanced analysis.  Using a fraud analysis example, we profiled banking alerts across business lines.   The fraud analyst revealed that specific loan officers were linked to more than one fraud alert.  The alerts also seemed to be concentrated in specific branches.

This post tackles the 3rd phase in the analysis - advanced analysis using relationship graphs.  Unlike traditional forms of business intelligence which usually include summary level charts in a dashboard format, relationship graphs show linkages (relationships) between data entities.   Here's a simple relationship graph from an earlier post that shows linkages between people, flights and addresses:



This graph shows that three different people are linked to one common address at 2911 Major Avenue in Minneapolis.  It shows the flights they took and other addresses with which they are associated.  Using this type of data visualization, intelligence analysts identify important connections between data.  They discover "networks" of people, activity and events.  Additional investigation may include watch list checks, identity verification of people in the network and supplemental data analysis using related information from blogs or news.

Relationship graphs are not only used by government agencies and local law enforcement.   CRM analysts explore product purchasing behavior by customer, type of product, store and region.  Marketers measure lead generation performance by analyzing linkages between key phrases used from the major search engines, web pages, completed web forms, opportunities and closed deals.  Pharmaceutical companies identify influential networks of physicians based on accreditations, hospital affiliations, publications, patients and other attributes.

Returning to our example on Fraud Analysis, let's use this form of advanced analysis to show relationships between banking customers, loan officers, branch affiliation and the address for the property associated with the bank loan.



After filtering the data to analyze just high appraisal alerts, the analyst notices that some customers are linked to properties in states where the loan officer is not affiliated.   For example, Dan Lane owns a property in Washington State.  His loan officer is Charles Head who is assigned to three branches, none of which are in Washington.  Robert Miles has a loan for a property in Maryland with a loan officer (Jack Carnahan) who works in the Los Angeles Branch.  John Kilpatrick (center of the graph) exhibits similar data anomalies.  These types of insights are almost impossible to discern from detailed tables, spreadsheets or charts.   But relationship graphs reveal them instantly.  

Relationship graphs can also be constructed using data driven attributes.  For example, analysts can pinpoint the most connected nodes or the links with the highest value.  When combined with other forms of data visualization, a more detailed picture is revealed.   In the graph below, the loan officers and banking customers are scaled based on the number of connections they have.   The thickness of the links shows the amount of money at risk to the bank.  The timeline on the left shows the length of time between account origination and an alert being triggered.   Since the visualizations interact with one another, the analyst can identify a person of interest in seconds rather than days. 



For example, a short interval alert may correspond to a customer connected to more than one fraud alert.  That customer may be connected to a loan officer who shares connections with other people of interest.  Each of these people may be involved in banking transactions where the money at risk to the bank is significant.

In this case, advanced analysis using relationship graphs has provided a detailed picture of connections the fraud analysts can use to isolate cases, prioritize resources and investigate at a pace far beyond what he could have done using traditional forms of business intelligence.   Time saved in this type of analysis can be enormous.   Accurate results are a by-product of the process.

As we will learn in our next two posts, these visualizations are very effective forms of communication allowing analysts to collaborate.   When coupled with the flexibility to integrate other sources of data, relationship graphs can reveal even greater insights. 

This type of analysis has been applied across many domains.  Fraud, Cyber and Intelligence analysis represent three core areas where these techniques have proven useful.  But the applications of relationship graphing extend far beyond these domains.   With the growth of social media, Social Network Analysis (SNA) is becoming more widely adopted to identify important connections, affiliations and spheres of influence across a wide variety of data sets.   At the heart of SNA is the idea that certain people, topics and events are influential within and outside the network.  This same application is being applied to identify and measure other spheres of influence in the life sciences world and social media.   Since a breakdown in one part of the network could negatively impact other parts of the network, the same techniques can be applied in manufacturing, sales and e-commerce.   Some of these important topics will be explored in future posts.

You can learn more about the application of data visualization techniques, please visit or

Originally posted on Data Science Central

Read more…

Guest blog post by Scott Mongeau

Original blog post on, by Scott Mongeau.


Executive summary

Network analysis offers a new set of techniques to tackle the persistent and growing problem of complex fraud. Network analysis supplements traditional techniques by providing a mechanism to bridge investigative and analytics methods. Beyond base visualization, network analysis provides a standardized platform for complex fraud pattern storage and retrieval, pattern discovery and detection, statistical analysis, and risk scoring. This article gives an overview of the main challenges and demonstrates a promising approach using a hands-on example.

Understanding the problem of fraud detection

With swelling globalization, advanced digital communication technology, and international financial deregulation, fraud investigators face a daunting battle against increasingly sophisticated fraudsters. Fraud is estimated to encompass 5% of the global economy, resulting in an annual loss of more than €2.3 trillion. Further, indications are that fraud is growing in volume, scope, and sophistication.


In an increasingly global and virtual world, the methods for perpetrating fraud are growing in sophistication. As well, fraudsters are increasingly able to collaborate in international rings to perpetrate their schemes and to distribute ill-acquired gains.


A complication in the effort to detect and mitigate complex cases of fraud is the difficulty of smoothly bridging the worlds of forensic investigation and data analytics. Fraud investigators, with deep domain knowledge and street smarts, plough through complex documents, interview parties of interest, and spend time understanding the arcane schemes by which fraudsters attempt to avoid detection.

However, the growing scale of complex fraud means that investigators are increasingly being overwhelmed by volume. Additionally, fraud specialists have deep knowledge of complex fraud cases – tacit knowledge – yet it is difficult to make this knowledge explicit such that it is suitable for efficient sharing. This is especially the case in terms of difficulties in describing complex fraud in terms of patterns suitable for systems-driven detection and analysis.

Meanwhile, data analytics experts gather, transform, and analyse datasets for possible fraud, addressing the challenges of scale and volume via ‘big data’ approaches. Statistical techniques for detecting outliers and algorithmic techniques for identifying suspicious patterns are applied, machine learning for example.

However, fraud detection via advanced analytics typically depends on structured datasets and structured data models. As a result, it is rare that exhaustive datasets are available which encompass all the domains surrounding complex cases of fraud.

As well, machine learning and data mining methods are primarily ‘supervised’, meaning they require training datasets which contain known fraud cases. Sophisticated fraudsters often are knowledgeable concerning automated detection methods and take pains to evade such detection. As a result, complex and ‘innovative’ types of fraud potentially circumvent automated detection when the methods avoid upsetting standard processes (i.e. they leave a seemingly ‘normal’ data trail).

Network analytics for fraud detection

Network analytics is a powerful tool to amplify traditional fraud investigative approaches – a method for cataloguing known, detecting hidden, and discovering new types of fraud. What is principally lacking in the disconnection between the forensics world and the world of data analytics is a transparent, standard language for communicating and searching for complex fraud patterns.

The fraud investigative world deals in rich details and confronts constantly emerging and evolving techniques. The computational world typically communicates in highly structured, abstract datasets and applies analysis via structured datasets. Datasets are often limited in scope and relational database models are slow to accommodate rapidly evolving schemes. Somewhere along the line, the rich complexities of fraud schemes evade both hands-on and automated detection.

This is where network graph analysis is of central value– it offers a method for capturing the rich context of fraud in a standard, machine readable and transferable format. Once captured in such a format, deep pattern and statistical analysis can be conducted on existing datasets. Network analytics is thus a complementary approach which enhances and bridges fraud investigatory and data analytics approaches.

The schemes to dodge or exploit taxes are manifold and range from simple to labyrinthine. In particular enterprise and institutions suffer when complex schemes are systematized at a high-volume or involve transactions in high amounts. Sophisticated fraudsters operating at this scale often operate in rings and across borders.

Case in point – EU VAT fraud

As an example, particular markets in the EU are susceptible to cross-border fraud schemes whereby participants seek to avoid value-added-tax (VAT) charges and exploit national tax credits. The amounts are substantial, with some EU countries foregoing or improperly crediting VAT charges of as high as 25%. Avoidance and claiming improper credits together systemically cut into national tax revenues across the EU.

Fraudsters are savvy in targeting particular markets and borders, often operating via complex sets of cross-border holding companies and ownership structures. Emerging, unregulated, and highly dynamic markets are particularly at risk, such as those associated with emerging or high-volume specialized commodities. As well, markets which deal in tradable rights or other intangibles are at risk, as they do not leave a physical trail (i.e. lack of witnesses, shipping records, and storage manifests).

Via native network data analysis, such complex fraud schemes can be described in both their general and specific manifestations. As an example, a recognized VAT fraud involves trading international telecommunication rights (the exchange of rights to telecommunication service). The pattern of a particular scheme was translated into a network format and stored in a ‘graph database’ (a native database for storing, managing, and retrieving networked data):


Figure 1: Cross-border EU value-added-tax fraud scheme involving a missing trader and tax credit abuse as encoded in a standard network format with countries denoted (names fictionalized)

The scheme can be summarized succinctly as thus:

  1. Southern Europa Telco (3-) buys U.S. phone card rights from two U.S. companies (1- and 2-),
  2. Southern Europa Telco (3-) re-sells within Italy to joint Bridge Co. (4-) and collects VAT,
  3. Southern Europa Telco (3-) does not pay VAT to Italian tax authorities, instead disappearing with the VAT and becoming a missing trader,
  4. Joint Bridge Co. (4-) resells to Swift Co. (6-) within Italy via parent company Joint IT Group (A-),
  5. Swift Co. (6-) pays VAT to Joint IT Group (A-),
  6. Swift Co. (6-) sells across border to UK Chips Trading Ltd. (7-) and U.S. Nexus Global US Ltd. (11-),
  7. Swift Co. claims VAT credit from Italian tax authorities to offset other international business activities,
  8. Chips Trading Ltd. firm sells to Strand VI Co. (9-) in Virgin Islands via sister firm Chips Global (8-) within Chips UK Group (B-) – this allows Chips UK Group (B-) to claim VAT neutrality,
  9. Strand VI Co. in the Virgin Islands becomes the final recipient of the phone card rights, which can then be recycled to the U.S. Presumably a back-door mechanism exists within the Virgin Islands for participants to share in the benefits: VAT appropriated by missing trader and Italian VAT tax credits.

Recognized schemes, often the result of an intensive fraud investigation, can thus be encoded using a standard format. The pattern can then be used to detect similar transactions in large datasets. However, the Italian national tax authority, absent full details from foreign tax authorities, likely only has insight into a reduced transactional view of this scheme. Namely, only initial transactions across the border and within Italy are likely visible:


   Figure 2:  Cross-border EU VAT tax fraud scheme from the perspective of Italian tax authorities

In this manifestation, it becomes difficult for the Italian tax authorities to apply traditional automated data analytics detection methods (i.e. data mining or machine learning). However, by having documented the full VAT fraud scheme in a network format, characteristic details of the fraud can also be documented. In particular, several unusual aspects of the Italian companies were resident (and can be stored) in the full fraud pattern documented previously:

  1. Transience of the missing trader: the chief earmark of this fraud pattern involves indications that the missing trader is a ‘front’ – a company set-up quickly with the intention of disappearing quickly. Data from the Chamber of Commerce and tax office concerning the inception date of the company may indicate that it is close to the initial purchase transaction, triggering an alert. As well, upon a warning, forensics investigators can examine additional details to substantiate the company as being ‘at risk’ – for instance, a false on non-answering phone number, an unoccupied address, and/or a ‘fake’ website.
  2. Velocity: for the fraud to operate at a low risk of detection, the entire transaction is likely completed in a relatively compressed period of time (ideally before the missing trader is detected by the tax office) – the short time-span (based on date signatures on the transactions in the data) can be calculated and detected,
  3. Position of the missing trader: the missing trader is the initial purchaser at the border – the entire rapid transaction chain (as per b) exiting the country in three steps could be used to trigger an alert to immediately check the validity of the initial purchaser, as per a.
  4. Volume and/or scale: for the fraud to be commercially viable, it needs to be conducted either at great volume or scale – indications of multiple transaction chains along the same path in a short time period and/or large transactions are potential alerts to check a.
  5. Additional data: company ownership by citizens (national citizen number) can be layered onto network data – citizens with ownership stakes in two or more companies in the transaction chain would be considered suspicious, for instance, and
  6. Third-party data: data from the police, banks, and credit agencies can be layered onto the network data to identify individuals and companies with a high-risk for fraud and resulting scores can be used in aggregate to rate a transaction chain as high risk!

Working with the Neo4J graph database, we can encode such a fraud scheme pattern via a Cypher statement. This pattern represents an approximation of the limited set of transactions visible to the Italian authorities:

CREATE (CO1)-[:SELLS_TO{date: '41548', item_type: 'phone cards rights', epoch: 1380617873, amt: '10000000'}]->(CO3)

CREATE (CO2)-[:SELLS_TO{date: '41548', item_type: 'phone cards rights', epoch: 1380617873, amt: '15000000'}]->(CO3)

CREATE (CO3)-[:SELLS_TO{date: '41557', item_type: 'phone cards rights', epoch: 1381395473, amt: '25000000'}]->(CO4)

CREATE (CO12)-[:SELLS_TO{date: '41562', item_type: 'phone cards rights', epoch: 1381827473, amt: '25000000'}]->(CO6)

CREATE (CO6)-[:SELLS_TO{date: '41567', item_type: 'phone cards rights', epoch: 1382259473, amt: '25000000'}]->(CO7)

CREATE (CO6)-[:SELLS_TO{date: '41572', item_type: 'phone cards rights', epoch: 1382691473, amt: '25000000'}]->(CO11)

CREATE (CO8)-[:SELLS_TO{date: '41577', item_type: 'phone cards rights', epoch: 1383123473, amt: '25000000'}]->(CO9)

CREATE (CO3)-[:COLLECTS_VAT{date: '41557', item_type: 'VAT paid', epoch: 1381395473, amt: '10000000'}]->(CO4)

CREATE (CO12)-[:COLLECTS_VAT{date: '41562', item_type: 'VAT paid', epoch: 1381827473, amt: '10000000'}]->(CO6)

CREATE (CO12)-[:PARENT_OF{date: '', item_type: 'parent company'}]->(CO4)

CREATE (CO12)-[:PARENT_OF{date: '', item_type: 'parent company'}]->(CO5)

CREATE (CO13)-[:PARENT_OF{date: '', item_type: 'parent company'}]->(CO7)

CREATE (CO13)-[:PARENT_OF{date: '', item_type: 'parent company'}]->(CO8)

CREATE (CO14)-[:PARENT_OF{date: '', item_type: 'parent company'}]->(CO10)

CREATE (CO14)-[:PARENT_OF{date: '', item_type: 'parent company'}]->(CO11)

CREATE (P01)-[:DIRECTOR_OF{date: '', item_type: 'director'}]->(CO1)

CREATE (P02)-[:DIRECTOR_OF{date: '', item_type: 'director'}]->(CO2)

CREATE (P03)-[:DIRECTOR_OF{date: '', item_type: 'director'}]->(CO3)

CREATE (P04)-[:DIRECTOR_OF{date: '', item_type: 'director'}]->(CO4)

CREATE (P05)-[:DIRECTOR_OF{date: '', item_type: 'director'}]->(CO5)

CREATE (P06)-[:DIRECTOR_OF{date: '', item_type: 'director'}]->(CO6)

CREATE (P07)-[:DIRECTOR_OF{date: '', item_type: 'director'}]->(CO7)

CREATE (P08)-[:DIRECTOR_OF{date: '', item_type: 'director'}]->(CO8)

CREATE (P09)-[:DIRECTOR_OF{date: '', item_type: 'director'}]->(CO9)

CREATE (P10)-[:DIRECTOR_OF{date: '', item_type: 'director'}]->(CO10)

CREATE (P02)-[:DIRECTOR_OF{date: '', item_type: 'director'}]->(CO11)

CREATE (P04)-[:DIRECTOR_OF{date: '', item_type: 'director'}]->(CO12)

CREATE (P11)-[:DIRECTOR_OF{date: '', item_type: 'director'}]->(CO13)

CREATE (P12)-[:DIRECTOR_OF{date: '', item_type: 'director'}]->(CO14)

Given this pattern and knowing the tell-tale aspects of the fraud, query can be developed which will identify a similar pattern in a large set of transactional data. In this example, we would like to identify any sets of cross-border telecommunication rights trades occurring over a short period of time (i.e. less than 15 days) and whereby an intermediary company in the chain of transactions is quite new (i.e. less than 90 days old).

Working with Cypher, we can query a large Neo4J dataset for this specific pattern in tax transactions (thanks to Jean Villedieu of for the query design):

MATCH p=(a:Company)-[rs:SELLS_TO*]->(c:Company)


WITH p, a, c, rs, nodes(p) AS ns

WITH p, a, c, rs, filter(n IN ns WHERE n.epoch – 1383123473 < (90*60*60*24)) AS bs

WITH p, a, c, rs, head(bs) AS b


WITH p, a, b, c, head(rs) AS r1, last(rs) AS rn

WITH p, a, b, c, r1, rn, rn.epoch – r1.epoch AS d

WHERE d < (15*60*60*24)

RETURN a, b, c, d, r1, rn

To summarize, having identified the full fraud pattern, we abstracted a version limited to data available to the Italian tax authorities. These details were used to design a specific query which can then identify the fraud pattern in a large set of tax transaction data.

The full fraud pattern, stored in a graph database ‘fraud library’ in an annotated, network-descriptive format, gives tell-tale indications for detection in the smaller pattern-set available to the national tax authorities. This then supports detection in a large set of national tax data.

Beyond visualization: statistical measures

The value of storing fraud schemes as standard patterns in a network format (in a graph database) can be summarized as:

  • standardization without sacrificing detail,
  • ability to communicate patterns between systems transparently,
  • ability to amplify patterns with additional data, and
  • ability to run dynamic network queries on ‘big data’ sets.

However, an additional benefit exists – the ability to characterize statistical measures to empower the discovery of new patterns and automatic pattern detection.

Network science and graph analysis encompasses rich, existing fields of study which specify and study reoccurring patterns and quantitative aspects of networks. Likewise, the social sciences have adopted these principals to study social phenomenon via social network analysis (SNA).

Together, these domains observe that all network structures have common patterns, and that these patterns can be studied and quantified. Networks can be measured in terms of hard measures such as reach, clustering or modularity, centrality, and dispersion. Transactions entail steps across a network, and these steps can be scored in terms of ‘weight’, for instance in terms of volume, frequency, speed (over time), amount (monetary), or risk (i.e. in terms of credit risk). Additionally, individuals and companies can be assessed in terms of their relative positions and interactions in a network.

As an example, returning to the VAT fraud example, national tax offices have data concerning company cross-ownership and the association of citizens (via national identification numbers).   These details can be used to assess the association of known fraudsters or high-risk individuals with others. Thus, a seemingly ‘clean’ company or individual which transacts frequently or in a high amounts between two high-risk entities could be flagged in terms of participating in at-risk transactions. The results can then be used to enhance traditional machine learning detection methods.


Figure 3: Utilizing networked data to establish risk scores for transactions and company associations, which can be used to enhance machine learning approaches

Summary conclusion

The native storage of fraud patterns as network phenomenon, and the application of these patterns to fraud detection is a powerful technique. This approach allows for the composition of ‘fraud libraries’ to capture rich details concerning schemes. Once encoded, tell-tale features of the fraud can be identified to give investigators indications of where to focus automated detection efforts. Additionally, storing and analysing network data leads to new types of indicators via network analysis: statistical measures and the ability to ‘score’ transactions and associations for aggregate risk. At the cutting edge, data on networks can be examined and simulated over time to gain new insights into how markets and transactions are evolving in character – a foundation for strategy formation and proactive preparedness.

Read more…

The 5 V's of Big Data by Bernard Marr

Guest blog post by Bernard Marr

Nice infographics produced by famous business management consultant and author, Bernard Marr. Click on the picture, then click one more time on the picture, to see easy-to-read version.

DSC Resources

Additional Reading

Follow us on Twitter: @DataScienceCtrl | @AnalyticBridge

Read more…

Originally posted on Big Data News

With the growing amount of information being created daily, big data is providing marketers a means to find consumers who are in-market for the products or services your company sells. However, since this data is growing so rapidly and is stored in so many places (i.e. social networks, blogs, forums, public records, databases), it is near impossible for marketers to quickly find all the data they need on in-market prospects without expertise and tools.

I created an infographic that briefly explains how leading marketers are using a service such as Data-as-a-Service (DaaS) to find and target prospects who are currently in-market for products and services their company sells. You can view the infographic below:

Learn more.

Read more…

3 Ways to Get Your Data Into Shape [Infographic]

Originally posted on Data Science Central

As data continues to grow at unprecedented speeds, organizations must embrace a data-driven mind-set to stay competitive. With the influx of bigger data and new types of data, companies of all sizes are increasingly dependent on large sets of information to make better business decisions. 

Marketing departments can rely on data to discover up-sell and cross-sell opportunities and to improve customer relationships. According to The 2014 Digital Marketer: Benchmark and Trend Report, 93 percent of organizations think data is essential to their marketing success.

For marketers to draw accurate insights, data must be accurate and clean. Unfortunately, poor data quality costs companies millions of dollars each year due to lost revenue opportunities, wasted marketing expenses, failure to meet regulatory compliance or failure to address customer issues in a timely manner.

For data to be of value it must be in tip-top shape. Here's three ways to get your data into shape:

Getting your data into shape is a very attainable goal for those companies who are willing to invest the time and resources. And for those companies that do? Companies that put data at the center of the marketing decisions improve their marketing return on investment (MROI) by 20%. (McKinsey)

Read more.

Read more…

Space is limited.

Reserve your Webinar seat now

In 1914, New Yorker Willard Brinton wrote Graphic Methods for Presenting Facts, the first book on telling stories through data and communicating information visually. Today, the volume of data in the world is exponentially increasing, the tools to transform analysis into stories are evolving—and 100 years later, Brinton’s lessons still hold true.

In this next DSC webinar event, we will explore:

  • Visualization basics that withstand the test of time
  • The right charts for telling the right stories
  • Brinton’s checklist for communicating data

Speaker: Andy Cotgreave, Senior Technical Evangelist Manager  --Tableau Software

Hosted byBill Vorhies, Editorial Director -- Data Science Central

Again, Space is limited so please register early:

Reserve your Webinar seat now


After registering you will receive a confirmation email containing information about joining the Webinar.
Read more…

Big data and the retail industry: infographics

Submitted by ColourFast. Enjoy!

DSC Resources

Additional Reading

Follow us on Twitter: @DataScienceCtrl | @AnalyticBridge

Read more…

Guest blog sent by Veronica Johnson at

see the original here

In this day and age of big data and information overload, data visualizations are, hands down, the most effective way of filtering out and presenting complex data.

A single data visualization graphic can be priceless. It can save you hours of research. They’re easy to read, interpret, and, if based on the right sources, accurate, as well.  And with the highly social nature of the web, the data can be lighthearted, fun and presented in so many different ways.

What’s most striking about data visualizations though is that they aren’t as modern a concept as we tend to think they are.

In fact, they go back to more than 2,500 years—before computers and tools for easy visual representation of data even existed.

Curious to see how data visualizations developed over time?

Below is an infographic that highlights 11 unique data visualizations from across different—yet significant, periods in history. It includes the first world map created by Anaximander, the elaborate Catalan atlas commissioned by King Charles V of France, Dr. John Snow’s map of cholera deaths in London that helped in combating the disease in the second half of the 19th century, and so on.

From ancient Greece and Medieval France to Victorian England and 19th century Sweden, these data visualizations and creators were ahead of the times, innovating the way in which information could be presented.

Whether they knew it or not at the time, these creators helped to develop an essential modern-day tool that is now invaluable to the world of statistics. Take a look.

Originally posted on Data Science Central

Read more…

Featured Blog Posts - DSC

Webinar Series

Follow Us

@DataScienceCtrl | RSS Feeds