Subscribe to our Newsletter

All Posts (217)

Originally posted on Data Science Central

1 The Tragedy

As most of you guys know, the country of Nepal has been hit with a powerful earthquake killing over 1,800 lives(as of this writing). Thousands are injured. Lives are disrupted. Historic temples, monuments, buildings are leveled.

I plead you to help whatever you can. Money. Time. Good wishes, thoughts, and prayers. Scroll to the end of this post for where you can help. Thank you and enjoy the post.

2 Mapping & Visualzing the Nepal Earthquake

For this analysis, I scrape the data from this website, which has the latest earthquake data worldwide.

Let’s look at the first 5 rows of this dataset. We see that this data set has Dates(down to the hour, minute, and second) that the earthquake occurred, the geo-location(latitude & longitude), Magnitude, country’s location/orgin, and other items. Please kindly note that,as usual, there were some data munging to get the data to usable format since it’s a web html table scrape.

##                   Date    LAT     LON Magni DepthKM## 1 26-APR-2015 14:05:31 -19.09 -176.30   4.6     298## 2 26-APR-2015 13:11:15  27.76   85.37   4.5      10
## 3 26-APR-2015 12:55:00 -1.94 100.20 4.8 57
## 4 26-APR-2015 12:12:07 -31.66 -177.54 4.7 41
## 5 26-APR-2015 10:38:04 -11.03 120.67 4.8 35
## Map EventID loc
## 1 FIJI ISLANDS REGION 5111954 -19.09:-176.30
## 2 NEPAL 5111951 27.76:85.37
## 3 SOUTHERN SUMATERA, INDONESIA 5111938 -1.94:100.20
## 4 KERMADEC ISLANDS REGION 5111936 -31.66:-177.54
## 5 SOUTH OF SUMBA, INDONESIA 5111933 -11.03:120.67

But we are only interested in Nepal earthquakes so let’s filter to just Nepal.

##         Date   LAT   LON Magni DepthKM   Map EventID         loc## 1 2015-04-26 27.76 85.37   4.5      10 NEPAL 5111951 27.76:85.37## 2 2015-04-26 27.84 85.69   4.5      10 NEPAL 5111927 27.84:85.69
## 3 2015-04-26 27.59 85.68 4.6 10 NEPAL 5111926 27.59:85.68
## 4 2015-04-26 27.72 85.93 4.7 10 NEPAL 5111924 27.72:85.93
## 5 2015-04-26 27.71 85.83 5.0 10 NEPAL 5111923 27.71:85.83

This data, while interesting, needs to be visualized in a more easily consumable format. The best way is using a map. Here’s a visualization of all the earthquake spots according to latitude and longitude with the magnitudes of the earthquakes in red tooltip pins.


We see that most of the earthquakes centered around Kathmandu where most of the lives have been lost. Additionally, there were other quakes in more remote regions. This is a good data visualization, but it does NOT tell the entire story. If we want to see how the earthquakes unfolded over time we need a time series graph.

3 Hour by Hour Visualization of Earthquake Magnitudes

Here we see the first earthquake happened around 6:11am with a magnitude of 7.8. This is the number that’s been quoted in the press. However, interesting enough, there has been over 30 more aftershocks of varying magnitudes even up to the next day on April 26, 2015. These aftershocks are compounding the lives lost, property destruction, and human misery.

4 How you can help

I like to end this blog with a humanitarian plea. This is the human side of data science. As data scientist, we have lots of fun with data, numbers, visualizations, models, algorithms, machine learning, etc. etc. But we must also have a heart and compassion for our fellow human beings. Especially those who are suffering in such horrific natural disaster.

There are many ways to help. Donate money, time, or just send well wishes, thoughts, and prayers

Here are two fine organizations that are leading the relief efforts:

  1. World Vision
  2. Red Cross

There are others as well.

I’m sure and positive that Nepal and its people will band together and rebuild. Humans always have. The good human spirit is the strongest force in the universe.

Thank You.

Read more…

Fraud detection in retail with graph analysis

Guest blog post by Jean Villedieu

Fraud detection is all about connecting the dots. We are going to see how to use graph analysis to identify stolen credit cards and fake identities. For the purpose of this article we have worked with Ralf Becher, irregular.bi. Ralf is Qlik Luminary and he provides solutions to integrate the Graph approach into Business Intelligence solutions like QlikView and Qlik Sense.

Third party fraud in retail

Third party fraud occurs when a criminal uses someone else’s identity to commit fraud. For a typical retail operation this takes the form of individuals or groups of individuals using stolen credit card to purchase high-value items.

Fighting it is a challenge. In particular, it means having a capability to detect potential fraud cases in large datasets and a capability to distinguish between real cases and false positives (the cases that look suspicious but are legitimate).

Traditional fraud detection systems focus on threshold related to customers activities. Suspicious activities include for example multiple purchases of the same product, high number of transactions per person or per credit card.

Graph analysis can add an extra layer of security by focusing on the relationships between fraudsters or fraud cases. It helps identify fraud cases that would otherwise go undetected…until too late. We recently explained how to use graph analysis to identify stolen credit cards.

For the this article, we have prepared a dummy dataset typical of an online retail operation. It includes:

  • order details: product, amount, order-id, date;
  • personal details: first name, last name;
  • contact info: phone, email;
  • payment: credit card;
  • shipping: address, zip, city, country;
  • tracking: IP address.

To analyse the connections in our data, we stored it in a Neo4j, the leading graph database. The graph approach lies in modelling data as nodes and edges. Here is a schema of our data represented as a graph:

Graph data model

Graph data model

You can download the data here.

Finding suspicious transactions

Now that the data is stored in Neo4j, we can analyse it.

First of all we need to set a benchmark for what’s normal. Here is an example of a transaction:

Legitimate account

Example of a legitimate account

Now that we have an idea of what not to look we can start thinking about patterns specifically associated with fraud. One such pattern is a personal piece of information (IP, email, credit card, address) associated with multiple persons.

Neo4j includes a graph query language called Cypher that allows us to detect such a pattern. Here is how to do it:

//———————–
//Detect fraud pattern
//———————–
MATCH (order:Order)<-[:ORDERED]-(person:Person)
MATCH (order)-[]-(fact)
WITH fact, collect(order) as orders, collect(distinct person) as people
WHERE size(orders) > 1 and size(people) > 1
RETURN fact, orders, people
LIMIT 20

What this query does is search for shared personal pieces of information. It returns all groups of at least two persons and two orders connected by a common personal information.

To verify the accuracy of our query, fine-tune it or evaluate how to act on the alerts it returns, we will use graph visualization.

Case#1: multiple people sharing the same email

The address edmund@gmail.com is shared by 3 people

The address [email protected] (center) is shared by 3 people (purple nodes)

Here we can see that 3 persons are sharing the same email. Are we looking at a potential fraud? If we expand the graph, we can see that 3 persons have distinct addresses, IPs, phones and credit cards.

graph visualization

Data associated with the 3 distinct people using [email protected]

In isolation, each of this person looks normal. Edmund Cagliostro for example seems like a legitimate customer.

Details of Edmund Cagliostro

Details of Edmund Cagliostro

The fact that these seemingly distinct accounts share a common address is suspicious. It justifies to further investigate Edmund Cagliostro and its connections.

Case#2: multiple people using the same IP address

Our query also reveals an IP address shared by multiple persons.

4 - 0.106.244.75 and its connections

An IP address (center) with connections to 5 persons (purple) and orders (orange)

 

We can see that IP address 0.106.244.75 is shared by 5 people. Once again this is suspicious and should be investigated.

Graph visualization can help us inspect potential fraud cases and quickly evaluate them.

Identifying a ring of fraudsters

Now that we have found a couple of suspicious fraud cases, it’s time to dig deeper. We want to assess the full impact of an individual fraud to take appropriate actions.

Let’s say we noticed in our dummy dataset that a “Leisa Gugliotta” is involved in a fraud. Not only do we want to block any transactions from her but we also need to identify her potential accomplices. In order to do that, we need to see who else is using the personal information used by Leisa Gugliotta.

Here is how to do that via Cypher:

//———————–
//Who are Leisa’s accomplices?
//———————–
MATCH (suspect:Person {full_name:”Leisa Gugliotta”})
MATCH (fact)<-[:USED_EMAIL|:USED_PHONE|:USED_IP|:USED_CREDIT_CARD|:USED_ADDRESS]-(suspect)
MATCH (fact)<-[:USED_EMAIL|:USED_PHONE|:USED_IP|:USED_CREDIT_CARD|:USED_ADDRESS]-(other)
WHERE suspect <> other
RETURN suspect,other,collect(distinct fact) as facts
LIMIT 20

We can run the same analysis via Linkurious. The result is the following graph:

5 - fraud ring of Leisa Gugliotta

The people involved the fraud ring led by Leisa Gugliotta

 

This picture makes it easy to view that our retail operation has been targeted by a fraud ring. Leisa Gugliotta shares a credit card with one other person and a email address with 4 people. These fraudsters can all be identified by the connections between them. Now we can freeze their accounts and add their information to our blacklist.


Third party fraud means that personal pieces of information are reused to create fake identifies (know as synthetic identities). Graph analysis makes it possible to spot that pattern and prevent fraud. Through graph visualization, we can quickly evaluate potential fraud cases and make informed decisions. Try Linkurious now to learn more!

Read more…

Guest blog post by Doug Needham

How does centrality affect your Architecture?

Some time ago, I was responsible for a data architecture I had mostly inherited. There were a number of tweaks I worked to on to refine the monolithic nature of the main database. It was a time of upheaval in this organization. They had outgrown their legacy Computer Telephony Interface application. It was time to create something new. 
A large new application development team was brought in to develop some new software.
There was a large division of labor and processing where some things were handled by the new application, and another thing was developed to handle the data. Reporting, cleansing, analysis, ingress feeds, egress feeds, all of these went through the “less important” system. 
This was the system I was responsible for. 
In thinking about how best to explain a Data Structure Graph, I spent some time revisiting this architecture and brought it into a format that could be analyzed with the tools of Network Analysis. 
After anonymizing the data a bit, and limiting the data flows to only the principle data flows, I constructed a csv file to load into Gephi for analysis.
.
.
Source
Target
Edge_Label
Spider
ODS
Application
ODS
Spider
Prospect
Vendor1
ODS
Prospect
Vendor2
ODS
Prospect
Vendor3
ODS
Prospect
ODS
Servicing
Application
Legacy
ODS
Application
ODS
Legacy
Prospect
ODS
Dialer1
Prospect
ODS
Dialer2
Prospect
Gov
ODS
DNC
ODS
Spider
LegacyData1
ODS
Spider
LegacyData2
ODS
Spider
LegacyData3
Spider
ODS
LegacyData1
Spider
ODS
LegacyData2
Spider
ODS
LegacyData3
ODS
ThirdParty
Prospect
ThirdParty
ODS
Application
Legacy
ODS
Application
Legacy
ODS
DialerStats
Dialer1
ODS
DialerStats
Dialer2
ODS
DialerStats
I ran a few simple statistics on the graph, then did some partitioning to color the graph to make it apparent the degree of a node this is the first output of Gephi:
The actual statistics Gephi calculated are in this table:
.
.
Id
Label
PageRank
Eigenvector Centrality
In-Degree
Out-Degree
Degree
Vendor1
Vendor1
0.01991719
0.00000000
0
1
1
Vendor2
Vendor2
0.01991719
0.00000000
0
1
1
Vendor3
Vendor3
0.01991719
0.00000000
0
1
1
Gov
Gov
0.01991719
0.00000000
0
1
1
Spider
Spider
0.08121259
0.44698155
1
1
2
Servicing
Servicing
0.08121259
0.44698155
1
0
1
Legacy
Legacy
0.08121259
0.44698155
1
1
2
Dialer1
Dialer1
0.08121259
0.44698155
1
1
2
Dialer2
Dialer2
0.08121259
0.44698155
1
1
2
ThirdParty
ThirdParty
0.08121259
0.44698155
1
1
2
ODS
ODS
0.43305573
1.00000000
9
6
15
From the Data Architecture perspective, which “application” has the greatest impact to the organization if there were a failure?
Which “application” should have the greatest degree of protection, redundancy, and expertise 
associated with it? 
Let's cover in detail the two metrics in the middle of the last table PageRank, and Eigenvector Centrality. 

I will have to create individual blog entries for both PageRank and Eigenvector Centrality to discuss the actual mechanism for how these are calculated. The math for these can be a bit cumbersome, and each algorithm should be given due attention on its own.

The point of this analysis is to determine which component of the architecture should have additional resources devoted to it. For any customer facing application, it should be given due attention, and infrastructure. However, one question I have seen many of my clients struggle with is what is the priority of the back-end infrastructure? Should once component of the architecture be given more attention than another? I have 90 databases throughout the organization, which one is the most important?

These centrality calculations show unequivocally which component of the architecture has the most impact in the event of an outage, or where the most value can be provided for an upgrade.
  
This type of analysis can begin to shed light on the answers to these questions. A methodical approach to an architecture based on data, rather than the division that screams the loudest can give insight into how an architecture is truly implemented.

I call these artifacts a  Data Structure Graph
Read more…

Python Visualization Libraries List

Originally posted on Data Science Central

ggplot

ggplot is a plotting system for Python based on R's ggplot2 and the Grammar of Graphics. It is built for making profressional looking, plots quickly with minimal code.

 

Seaborn

Seaborn is a library for making attractive and informative statistical graphics in Python. It is built on top of matplotlib and tightly integrated with the PyData stack, including support for numpy and pandas data structures and statistical routines from scipy and statsmodels.

Seaborn offers:

  • Several built-in themes that improve on the default matplotlib aesthetics

  • Tools for choosing color palettes to make beautiful plots that reveal patterns in your data

  • Functions for visualizing univariate and bivariate distributions or for comparing them between subsets of data

  • Tools that fit and visualize linear regression models for different kinds of independent and dependent variables

  • Functions that visualize matrices of data and use clustering algorithms to discover structure in those matrices

  • A function to plot statistical timeseries data with flexible estimation and representation of uncertainty around the estimate

  • High-level abstractions for structuring grids of plots that let you easily build complex visualizations

 

matplotlib

matplotlib is a python 2D plotting library which produces publication quality figures in a variety of hardcopy formats and interactive environments across platforms. You can generate plots, histograms, power spectra, bar charts, errorcharts, scatterplots, etc, with just a few lines of code

 

Bokeh

Bokeh is a Python interactive visualization library that targets modern web browsers for presentation. Its goal is to provide elegant, concise construction of novel graphics in the style of D3.js, but also deliver this capability with high-performance interactivity over very large or streaming datasets. Bokeh can help anyone who would like to quickly and easily create interactive plots, dashboards, and data applications.

 

pygal

pygal is a dynamic SVG charting library. It features various graph types like bar charts, line charts, XY charts, pie charts, radar charts, dot charts, pyramid charts, funnel charts, gauge charts. It features css features with pre-defined themes.

 

python-igraph

igraph is a collection of network analysis tools with the emphasis on efficiency, portability and ease of use. python-igraph is a python interface to the igraph. graph plotting functionality is provided by the Cairo library


This is a part of community edited list at Pansop

Use the data science search engine (check pre-selected keywords) to find many more resources about Python, visualization or R, applied to data science.

DSC Resources

Additional Reading

Follow us on Twitter: @DataScienceCtrl | @AnalyticBridge

 
Read more…

Originally posted on Data Science Central

OpenStreetMap (OSM) is a tool for creating and sharing map information. Anyone can contribute to OSM, that OSM maps are saved on the internet, and anyone can access them at any time. The growth of OSM has been phenomenal over the years and the number of registered users contributing to OSM is nearly 2 Million.

While OSM is a powerful open source alternative to Google Maps for your Geo analytics projects. In the below example, we have taken an example case study of visualizing car insurance rates on a map. The example car insurance data is available on Vozag.

You will see the end output of the 10 cheapest car insurance states in a green marker, the 10 most expensive states in a red market & the rest of the states in the middle in blue.

Top 10 Cheap Car insurance states in red marker:

  1. Maine : $830

  2. Vermont : $870

  3. Idaho : $970

  4. Iowa : $1030

  5. North Carolina : $1090

  6. Ohio : $1110

  7. New Hampshire : $1110

  8. South Carolina : $1160

  9. Indiana : $1180

  10. Washington : $1230

10 most expensive car insurance states in blue marker:

  1. Louisiana : $2700

  2. Michigan : $2520

  3. Georgia : $2160

  4. Oklahoma : $2070

  5. Montana : $1890

  6. California : $1820

  7. West Virginia : $1820

  8. Rhode Island : $1740

  9. Kentucky : $1730

  10. Connecticut : $1720

The data set would need to be in the following format:

  1. State: Louisiana

  2. Car Insurance: $2700

  3. Lat/Long : [30.391830, -92.329102]

The steps to create the open street map visualization are simple & are below:

  1. Import openlayers.js file from openlayers.org

  2. Create one div tag in the webpage with a specified width and height.

  3. Collect data of StateLatitudeLongitudeList,StateInsuranceList and markers and pass these lists in the below sample code format.

JavaScript sample code to visualize car insurance states on OpenStreetMap

====

<div id="mapdiv" style="width: 800px; height: 400px; margin: 0 auto"> </div>

<script src="http://www.openlayers.org/api/OpenLayers.js"></script>

 <script>

   map = new OpenLayers.Map("mapdiv");

   map.addLayer(new OpenLayers.Layer.OSM());

   epsg4326 =  new OpenLayers.Projection("EPSG:4326"); //WGS 1984 projection

   projectTo = map.getProjectionObject(); //The map projection (Spherical Mercator)

   lat_long_list = LatitudeLongitudeList;

   state_ins_list = StateInsuranceList;

   m = MarkersList

   var lonLat1 = new OpenLayers.LonLat(lat_long_list[1],lat_long_list[0]).transform(epsg4326, projectTo)

   var zoom=7;

   map.setCenter (lonLat1, zoom);

   var vectorLayer = new OpenLayers.Layer.Vector("Overlay");

   

   for(var i=0;i<lat_long_list.length;i++)

   {

     var feature = new OpenLayers.Feature.Vector(

     new OpenLayers.Geometry.Point(a[i][1], a[i][0] ).transform(epsg4326, projectTo),

     {description:state_ins_list[i]} ,{externalGraphic:'/images/'+m[i], graphicHeight: 25,

     graphicWidth: 21, graphicXOffset:-12, graphicYOffset:-25}   

      );                         

   vectorLayer.addFeatures(feature);

   }

   map.addLayer(vectorLayer);

   //Add a selector control to the vectorLayer with popup functions

   var controls = {

     selector: new OpenLayers.Control.SelectFeature(vectorLayer, { onSelect: createPopup, onUnselect: destroyPopup })

   };

   function createPopup(feature) {

     feature.popup = new OpenLayers.Popup.FramedCloud("pop",

         feature.geometry.getBounds().getCenterLonLat(),

         null,

         '<div class="markerContent">'+feature.attributes.description+'</div>',

         null,

         true,

         function() { controls['selector'].unselectAll(); }

     );

     //feature.popup.closeOnMove = true;

     map.addPopup(feature.popup);

   }

   function destroyPopup(feature) {

     feature.popup.destroy();

     feature.popup = null;

   }

   map.addControl(controls['selector']);

   controls['selector'].activate();

 </script>

 =====

DSC Resources

Read more…

Infographics on data quality

Submitted by Kendall Brennan, from Halo BI. Originally posted here.

DSC Resources

Additional Reading

Follow us on Twitter: @DataScienceCtrl | @AnalyticBridge

Read more…

Originally posted on Data Science Central

Guest blog post by Mike Davie.

With the exponential growth of IoT and M2M, data is seeping out of every nook and cranny of our corporate and personal lives. However, harnessing data and turning it into a valuable asset is still in its infancy stage of development.  In a recent study, IDC estimates that only 5% of data created is actually analyzed. Thankfully, this is set to change as companies now have found lucrative revenue streams by converting their data into products.

Impediments to Data Monetization

Many companies are unaware of the value of their data, the type of customers who might potentially be interested in those data, and how to go about monetizing the data. To further complicate matters, many also are concerned that the data they possess, if sold, could reveal trade secrets and personalized information of their customers, thus violating personal data protection laws.  

Dashboards and Applications

The most common approach for companies who have embarked on data monetization is to develop a dashboard or application for the data, thinking that it would give them greater control over the data. However, there are several downsides to this approach:

  • Limited customer base
    • The dashboard or application is developed with only one type of customer in mind, thus limiting the potential of the underlying data to reach a wider customer base.
  • Data is non-extractable
    • The data in a dashboard or application cannot be extracted to be mashed up with other data, with which valuable insights and analytics can be developed.
  • Long lead time and high cost to develop
    • Average development time for a dashboard or application is 18 months. Expensive resources including those of data scientists and developers are required.  

Data as a Product

What many companies have failed to realize is that the raw data they possess could be cleansed, sliced and diced to meet the needs of data buyers. Aggregated and anonymized data products have a number of advantages over dashboards and applications.

  • Short lead time and less cost to develop
    • The process of cleaning and slicing data into bite size data products could be done in a 2-3 month time frame without the involvement of data scientists.
  • Wide customer base
    • Many companies and organizations could be interested in your data product.  For example, real time footfall data from a telco could be used in a number of ways:
      • A retailer could use mall foot traffic to determine the best time of the day to launch a new promotion to drive additional sales during off-peak hours.
      • A logistics provider could be combining footfall data with operating expenses to determine the best location for a new distribution centre.
      • A maintenance company could be using footfall to determine where to allocate cleaners to maximize efficiency, while ensuring clean facilities.
  • Data is extractable
    • Data in its original form could be meshed and blended with other data sources to provide unique competitive advantages.  For example:
      • An airline could blend real time weather forecast data with customer profile data to launch a promotion package prior to severe bad weather for those looking to escape for the weekend.
      • Real time ship positioning data could be blended with a port’s equipment operation data to minimize downtime of the equipment and increase overall efficiency of the port.

Monetizing your data does not have to a painful and drawn out undertaking if you view data itself as the product. By taking your data product to market, data itself can become one of your company’s most lucrative and profitable revenue streams. By developing a data monetization plan now, you can reap the rewards of the new Data Economy.

About the Author:

Mike Davie has been leading the commercialization of disruptive mobile technology and ICT infrastructure for a decade with leading global technology firms in Asia, Middle East and North America.

He parlayed his vision and knowledge of evolution of ICT into the creation of DataStreamX, the world's first online marketplace for real time data. DataStreamX’s powerful platform enables data sellers to stream their data to global buyers across various industries in real time, multiplying their data revenue without having to invest in costly infrastructure and sales teams. DataStreamX's online platform provides a plethora of real time data to data hungry buyers at the click of their fingertips, enabling them to broaden and deepen their understanding of the industry they compete in, and to device effective strategies to out-manoeuvre their competitors.

Prior to founding DataStreamX, Mike was a member of the Advanced Mobile Product Strategy Division at Samsung where he developed go-to-market strategies for cutting edge technologies created in the Samsung R&D Labs. He also provided guidance to Asia and Middle East telcos on their 4G/LTE infrastructure data needs and worked closely with them to monetize their M2M and telco analytics data.

Mike has spoken at ICT and Big Data conferences including 4G World, LTE Asia, Infocomm Development of Singapore's IdeaLabs Sessions. Topics of his talks include Monetization of Data Assets, Data-as-a-Service, the Dichotomy of Real-time vs. Static Data.

Read more…

A pletora of big data infographics

Originally posted on Data Science Central

A lot of interesting images can be found on Google. You can search for machine learning cartoons, fake data scientists, Excel maps or any keyword, and get a bunch of interesting images or charts, though the images barely change over time (Google algorithms are very conservative).

Anyway, here's some really interesting stuff. It definitely proves how popular infographics are, and the growth of big data. Many of these infographics are of high quality, well thought out and based on real research.

To see / access all of them at once, click here then select "Image" rather than "Web" search. Or try this link. More infographics can be found here.

 

DSC Resources

Additional Reading

Follow us on Twitter: @DataScienceCtrl | @AnalyticBridge

Read more…

How to Avoid a Data Disaster - Infographic

Infographics provided by SupremeSystems.

We reveal some interesting statistics around data loss and also offer some helpful advice about what an effective data backup plan should look like. For example, did you know that this year, 40% of small to medium businesses that manage their own network and use the Internet for more than e-mail will have their network accessed by a hacker? Also, find out what are the main causes for data loss and much more.

Originally posted on Data Science Central

DSC Resources

Additional Reading

Follow us on Twitter: @DataScienceCtrl | @AnalyticBridge

Read more…

For Python:

  • Seaborn - A visualization library based upon matplotlib. Although not interactive, the visualizations can be very nice.
  • Bokeh - Bokeh provides a bit more interaction than Seaborn, but it is still not fully interactive. 

For R

  • htmlwidgets - Allows for tons of interaction and great for the web.
  • ggplot2 - A very popular plotting system for R. It is widely used and can create just about every type of graph. However, the plots are not interactive. R visualization is a sample application that creates the graph below.

Source for picture: Science.io

For Julia:

  • Gadfly - A Julia library for visualizations. Inspired by ggplot2 for R. It is not really interactive, but it is a great start. See chart produced just with HTML tags.

  • Escher - Beautiful, interactive web UIs in Julia. Escher is rather new, so it is definitely a project to watch. It uses gadfly for graphics.

For more resources on Python, R, Julia, or visualization, try the data science search engine.

DSC Resources

Additional Reading

Follow us on Twitter: @DataScienceCtrl | @AnalyticBridge

Originally posted on Data Science Central

Read more…

Cheat Sheet: Data Visualization with R

RStudio has many interesting cheat sheets about R. Below is just one example. Other cheat sheets about Data Science, Python and R can be found here. Here are additional resources

Enjoy!

And below is what you can do with the Cowplot CRAN package (ggplot2 add-on) referred at the beginning of this note:

DSC Resources

Additional Reading

Follow us on Twitter: @DataScienceCtrl | @AnalyticBridge

Originally posted on Data Science Central

Read more…

10 Features all Dashboards Should Have

Here are 10 rules to design great, simple, efficient, adaptive dashboards.

  1. Decide which metrics should be included in your dashboard. Some metrics such as measuring the impact of sharing and likes, on email marketing campaigns, are not easy to estimate. Out-of-dashboard statistical modeling might be best for these sophisticated predictions. Some other metrics are easy to estimate with basic models, and can be integrated in the dashboard, using in-dashboard analytics and data science for predictions. An example that comes to my mind is the Google AdWords dashboard: clicks count forecast per ad group, based on keywords purchased, and on your average CPC.
  2. Provide potential explanations / recommended actions when forecasted numbers are very different from actual numbers. Strangely, Google AdWords predictions (see item #1) are either quite accurate or totally wrong: some new ad groups supposed to produce thousands of clicks (according to Google's dashboard forecasted numbers) are producing less than 5 clicks. We suspect that some factors - maybe impression fraud, or keywords / ads / landing pages blocked by Google, or too little historical data available resulting in poor machine learning performance - makes some ad groups to fail. I think Google's AdWords dashboard should provide explanations / recommendations (e.g. wait a few more days) when the discrepancy between observed and actual traffic volume is huge - there is a lot of potential revenue to be made by Google, should they improve the AdWords dashboard accordingly. Note that this issue mostly impacts new ad groups that have little history.
  3. Become a user expert about your dashboard - Outsmarting the dashboard architects, about how their dashboard can be used. In our case, we designed ISP segments in our VerticalResponse (VR) dashboard, to optimize our email marketing campaigns, and create a feature not available on VR. This was the best feature that we managed to get out of our VR dashboard. With Google Analytics, we found a way to get traffic trends by region (as we are trying to increase US traffic and decrease traffic from Asia, and measure our success) despite the fact that no such report is available from the dashboard; another solution is to hire an AdWords expert to extract the data in question. Another problem that we faced is comparing trends regarding two 2nd tier traffic sources , say LinkedIn vs. Twitter: because traffic referral charts include all referrals, 2nd tier sources are dwarfed by the top 3 referrals and the chart is useless; a workaround consists in downloading the data in Excel formart (from the dashboard), and creating your own charts.
  4. Prioritize user requests about new reports. In our case, we tried to get click counts as well as number of uniques clicking regarding our VR campaigns, to identify outlier users who generate 1,000+ clicks from a single account. The purpose was to identify fake traffic generated by a bot, or very popular users sharing our newsletter, generating thousands of clicks that can be attibuted to a few popular subscribers. This feature (tracking number of uniques clicking) was not available on VR (unless you download very granular reports - something business analysts are not good at), but discussing this issue with VR (and the fact that their competitors such as iContact offer this feature on their dashboard) helps improve the VR dashboard.
  5. Decide on optimum refresh rate for each report - Some users like real-time, I do, because it helps me detect right away if a marketing campaign or blog post is going to work or not. But if producing real-time reports is very expensive, maybe end-of-day is good enough. In my case, I enjoy Google Analytics real time reports, but it's like a nice perk, and I can do equally well (even better, by not wasting time being stucked on true real-time stats), by looking at daily or weekly stats. Real time offers some value, such as about which blogs I need to promote, but it is worth the price? In the case of Google Analytics, the answer is yes because it is free, though (as an executive) I feel I'm wasting too much time of these reports, given the relative value that they provide.

  6. Try to identify great, burried reports available on your dashboard - In our case, IP addresses of new members, provided by our Ning dashboard, has proved to be a very detailed, yet powerful and burried report that helps us eliminate spammers and scammers signing up on DSC. It also involves merging data from external vendors such as StopForumSpam or SpamHaus. This brings a new rule: build a meta-dashboard that blends both internal and external data, rather than working with isolated (silo) dashboards. 
  7. Customize reports, priority and user access - Allow users to design their own reports; display reports that are associated with top KPI's, and related to 50% of more of the revenue (nobody cares about the bottom ten IP addesses visiting your website). Give access to reports based on needs and user security clearance.

  8. Create a centralized report center - Merge silos, have a centralized dashboard that accesses data from various sources, both internal and external. In our case, Google AdWords, AdSense and Google Analytics are three separate dashboards that do not communicate. Fix issues like this one.
  9. Send email alerts to clients (internal or external) - Allow clients to choose which reports they want to recieve, as well as frequency (daily, weekly, or based on emergency). Prioritize your email alerts depending on recipient: urgent, high priority or not urgent. Train users to create specific email folders for your email alerts. Do A/B testing to see what kind of alarm system (frequency, type of reports) is most useful to your company.

  10. Create actionable dashboards offering automated actions - For instance we'd like to have our most popular tweets automatically advertised on our Twitter advertising account. This would be a win-win both for us and Twitter, but currently, the process is still very manual, because the Twitter advertising dashboard does not seem to provide a solution. Maybe it does, but if users can't find the magic button, it is useless. Training users (provide online help, but also offering AND reading user feedback) is a great way to make your dashboard successfull.
  11. Fast retrieval of information. One-time reports that are created on the fly by a user (as opposed to reports automatically populated every day) should return results very fast. I've seen Brio (a browser-based dashboard to create SQL queries) take 30 minutes to return less than a megabyte of data, even though there was no SQL join involved: when this happens, discuss with your sys admin, or use another tool (in my case, I trained the business analyst to directly write queries on the Oracle server, via a template Perl script accepting the SQL query as input, and returning the data as a tab-separated text file).

Related articles

Originally posted on Data Science Central

Read more…

Guest blog post by Roopam Upadhyay, from YOU CANalytics 

Killer Dashboards

Sometime ago I came across an Excel based dashboard with around 45 tabs, and every tab had around 10 charts/ graphs /tables. That is roughly 450 figures and tables. I could only sympathize with the end audience of that dashboard. Unfortunately, almost every organization is deluged with meaningless reports and dashboards of this kind. I immediately knew I had to write this article on YOU CANalytics to save the end users from the torture these reports can cause. Luckily nobody is reading these reports and for a good reason.

The problem is not with the Excel based dashboards but the clouded thinking on the part of the designers and analysts. Trust me no fancy software (Tableau, Business Objects or QlikView) can rectify the problem with a muddled brain. Hence let me suggest a few strategies to create winning reports and dashboards. The strategies have been framed as titles.

  1. It’s the Question, Stupid!
  2. Know Your Frazzled Audience
  3. Learn to Disagree if Required
  4. Tell a Good Story
  5. Edit, Edit, & Edit More
  6. Aesthetic Minimalism
  7. Actionable Insights – Predictive Analytics in Action!

Before I tell you more about the above strategies, let me share with you a piece of memory from my childhood that will serve us well in this journey.

The Mahatma

Let us go some 25 years back in time when India was not bombarded 24/7 with more than 500 channels on television. Then it was just one channel – Doordarshan – with less than 8 hours of broadcasting time in a day. Television was a lot calmer then. Doordarshan used to have a short clip about Mahatma Gandhi as filler. The clip was not more than a couple of minutes long with brush strokes leaving a white canvas with the doodle similar to the one shown adjacent. In the background a man’s calm voice recites the following – The greatness of this man was his simplicity. Let’s try and discover the Gandhi in ourselves. That is when I learned that one could depict the Mahatma with just a couple of brush strokes. I believe we need to discover simplicity within us and around to create relevant and winning dashboards. The following are my 7 tips for the same.

1)  It’s the Question, Stupid!

“[It's] the economy, stupid” was the slogan Bill Clinton effectively used in the US presidential election in 1992 against George H. W. Bush (Dubya Bush’s father). Bush’s campaign was focusing on the Iraq war and Saddam while Clinton hit the bulls eye by focusing on the right issue and won the election by a huge margin. I will modify the phrase a bit for creating winning reports by focusing on the single most important factor – “it’s the question, stupid”.

For me charts, graphs, analysis or even data come much later. I think the foremost thing of importance is the question these tools are trying to answer. Every dashboard, report, presentation, analysis or model has a business question which it tries to answer. I would recommend that you write down the question(s) in plain English before starting with analysis without assuming that everybody knows the question(s). After that everything is just an attempt to answer the question using the best possible medium / tools.

2)  Know Your Frazzled Audience

Okay if you don’t know them already let me introduce the end user of your dashboard, analysis or presentation to you. They are overloaded with information and a plethora of pending issues. To make things worse they are perpetually distracted with the emails on their smartphones. They are looking for simplicity but find it difficult to think straight within the chaos. Your analysis, report or dashboard could bring some simplicity in their work life, but they will find it difficult to see it at first. This will make your job really tough. But you cannot afford to get flustered by the surrounding, otherwise you will just add to yet another useless analysis or dashboard.

3)  Learn to Disagree if Required

Now that you know your end audience better, you will also appreciate that they may at times ask questions that reflect their confusion during the analysis phase. This is when you need to guide them, question their thought process and if you disagree, say no to their muddled ideas. Be polite but firm while doing so.

4)  Tell a Good Story

We all love stories. Presentations, reports and dashboards are just visual stories. The opening chapter of this story is of course the question(s). Once the questions are framed clearly, a good practice is to think of answering the questions like a story. The graphs, tables and charts are different characters in this story and they will help in moving the story forward. A story is nothing but a the logical progression of thought. For instance, a guest-appearance by Jabba-the-Hutt in a Shakespearean tragedy is unacceptable, similarly an unwanted chart or graph is enough to confuse your audience and make them disengaged. Use your characters wisely and arrange them in the logical order of the story.

5)  Edit, Edit, & Edit More

The great movies cannot be made without great editing. You need to be completely dispassionate and ruthless while editing your reports and dashboards. Remember your audience is not looking for a 3+ hours long ‘Schindler’s List’ movie but a succinct trailer. This is your last chance to remove those unwanted characters (figures and tables).  The hardest part is when you have to cut the role of your favorite actor – the analysis you have spent several sleepless nights working on but it does not fit well with your story. Keep your scissors sharp as you will need them.

6)  Aesthetics Minimalism

Mark Twain, one of the greatest writers once said: If you want me to give you a two-hour presentation, I am ready today.  If you want only a five-minute speech, it will take me two weeks to prepare. The idea captures the gist of one of the essential principles of designing dashboards and reports – less is often more!

I am a huge fan of Japanese minimalist art and design. The primary philosophy behind minimalism is to say more with fewer elements – like the depiction of Mahatma Gandhi displayed above or the adjacent picture. Similarly I believe for dashboards the graphics and charts should be simple and to the point. Simplicity goes a long way when it comes to designing dashboard. Though trust me, this is a really difficult and effort-intensive path.

7)  Actionable Insights – Predictive Analytics in Action!

The following is a case from the Emergency Room (ER) of Cook County Hospital in Chicago from the mid-1990s. Being a public hospital they serve poor people without health insurance.  The ER gets a quarter million emergency patients per year – this ensures a perpetual chaos-like situation for the doctors and medical staff to handle. The excessive inpatient rate also guarantees a consistent shortage in number of hospital beds. A patient walking in with chest pain could be a case of impending heart-attack or (not so dangerous) indigestion. This first case needs serious medical attention and the second a bit of reassurance and a bus ticket back home. The problem statement for the doctors and medical staff at Cook County hospital was simple: To distinguish severe cases vs. harmless cases fast and upfront to save as many lives as possible with the constraint of limited resources.

They were not looking for a fancy dashboard or sleek reports but a simple rule to help them make this decision fast. A competent bunch of statisticians (physicists actually) helped the hospital solve this problem. A predictive model suggested answering the following 3 ‘urgent risk factors’ along with the ECG report:

  1. Is the pain felt by the patient unstable angina?* (Answer: Yes / N0)
  2. Is there fluid in the patient’s lungs? (Answer: Yes / N0)
  3. Is the patient’s systolic blood pressure below 100? (Answer: Yes / N0)

What transpires was a simple decision tree that anyone could use to direct the patient to the appropriate treatment. Insights that facilitate action or actionable insight is exactly what the decision makers are looking for. Any dashboard, report or analysis that delivers this will never go unnoticed and is a winner.

* This case was published in Malcolm Gladwell’s best selling book ‘Blink’ and the factors are directly taken from the book.

Sign-off Note

I have done my bit by sharing the tips I feel will make people take dashboards more seriously. I believe, business metrics and KPIs for your dashboard are going to pop out naturally once you follow the above steps. Let me know if you have some suggestions for the above list, post your comments right below this.

Originally posted on Data Science Central

Read more…

Originally posted on Analytic Bridge

Fraud detection is all about connecting the dots. We are going to see how to use graph analysis to identify stolen credit cards and fake identities. For the purpose of this article we have worked with Ralf Becher, irregular.bi. Ralf is Qlik Luminary and he provides solutions to integrate the Graph approach into Business Intelligence solutions like QlikView and Qlik Sense.

Third party fraud in retail

Third party fraud occurs when a criminal uses someone else’s identity to commit fraud. For a typical retail operation this takes the form of individuals or groups of individuals using stolen credit card to purchase high-value items.

Fighting it is a challenge. In particular, it means having a capability to detect potential fraud cases in large datasets and a capability to distinguish between real cases and false positives (the cases that look suspicious but are legitimate).

Traditional fraud detection systems focus on threshold related to customers activities. Suspicious activities include for example multiple purchases of the same product, high number of transactions per person or per credit card.

Graph analysis can add an extra layer of security by focusing on the relationships between fraudsters or fraud cases. It helps identify fraud cases that would otherwise go undetected…until too late. We recently explained how to use graph analysis to identify stolen credit cards.

For the this article, we have prepared a dummy dataset typical of an online retail operation. It includes:

  • order details: product, amount, order-id, date;
  • personal details: first name, last name;
  • contact info: phone, email;
  • payment: credit card;
  • shipping: address, zip, city, country;
  • tracking: IP address.

To analyse the connections in our data, we stored it in a Neo4j, the leading graph database. The graph approach lies in modelling data as nodes and edges. Here is a schema of our data represented as a graph:

Graph data model

Graph data model

You can download the data here.

Finding suspicious transactions

Now that the data is stored in Neo4j, we can analyse it.

First of all we need to set a benchmark for what’s normal. Here is an example of a transaction:

Legitimate account

Example of a legitimate account

Now that we have an idea of what not to look we can start thinking about patterns specifically associated with fraud. One such pattern is a personal piece of information (IP, email, credit card, address) associated with multiple persons.

Neo4j includes a graph query language called Cypher that allows us to detect such a pattern. Here is how to do it:

//———————–
//Detect fraud pattern
//———————–
MATCH (order:Order)<-[:ORDERED]-(person:Person)
MATCH (order)-[]-(fact)
WITH fact, collect(order) as orders, collect(distinct person) as people
WHERE size(orders) > 1 and size(people) > 1
RETURN fact, orders, people
LIMIT 20

What this query does is search for shared personal pieces of information. It returns all groups of at least two persons and two orders connected by a common personal information.

To verify the accuracy of our query, fine-tune it or evaluate how to act on the alerts it returns, we will use graph visualization.

Case#1: multiple people sharing the same email

The address edmund@gmail.com is shared by 3 people

The address [email protected] (center) is shared by 3 people (purple nodes)

Here we can see that 3 persons are sharing the same email. Are we looking at a potential fraud? If we expand the graph, we can see that 3 persons have distinct addresses, IPs, phones and credit cards.

graph visualization

Data associated with the 3 distinct people using [email protected]

In isolation, each of this person looks normal. Edmund Cagliostro for example seems like a legitimate customer.

Details of Edmund Cagliostro

Details of Edmund Cagliostro

The fact that these seemingly distinct accounts share a common address is suspicious. It justifies to further investigate Edmund Cagliostro and its connections.

Case#2: multiple people using the same IP address

Our query also reveals an IP address shared by multiple persons.

4 - 0.106.244.75 and its connections

An IP address (center) with connections to 5 persons (purple) and orders (orange)

 

We can see that IP address 0.106.244.75 is shared by 5 people. Once again this is suspicious and should be investigated.

Graph visualization can help us inspect potential fraud cases and quickly evaluate them.

Identifying a ring of fraudsters

Now that we have found a couple of suspicious fraud cases, it’s time to dig deeper. We want to assess the full impact of an individual fraud to take appropriate actions.

Let’s say we noticed in our dummy dataset that a “Leisa Gugliotta” is involved in a fraud. Not only do we want to block any transactions from her but we also need to identify her potential accomplices. In order to do that, we need to see who else is using the personal information used by Leisa Gugliotta.

Here is how to do that via Cypher:

//———————–
//Who are Leisa’s accomplices?
//———————–
MATCH (suspect:Person {full_name:”Leisa Gugliotta”})
MATCH (fact)<-[:USED_EMAIL|:USED_PHONE|:USED_IP|:USED_CREDIT_CARD|:USED_ADDRESS]-(suspect)
MATCH (fact)<-[:USED_EMAIL|:USED_PHONE|:USED_IP|:USED_CREDIT_CARD|:USED_ADDRESS]-(other)
WHERE suspect <> other
RETURN suspect,other,collect(distinct fact) as facts
LIMIT 20

We can run the same analysis via Linkurious. The result is the following graph:

5 - fraud ring of Leisa Gugliotta

The people involved the fraud ring led by Leisa Gugliotta

 

This picture makes it easy to view that our retail operation has been targeted by a fraud ring. Leisa Gugliotta shares a credit card with one other person and a email address with 4 people. These fraudsters can all be identified by the connections between them. Now we can freeze their accounts and add their information to our blacklist.


Third party fraud means that personal pieces of information are reused to create fake identifies (know as synthetic identities). Graph analysis makes it possible to spot that pattern and prevent fraud. Through graph visualization, we can quickly evaluate potential fraud cases and make informed decisions. Try Linkurious now to learn more!

Read more…

Another cute graph

Originally posted on Analytic Bridge

Not sure how this firework graph was produced. It "shows" 10 million emails sent through the Yahoo! Mail service in 2012, a team of researchers used the R language to create a map of countries whose citizens email each other most frequently. Click here for another interesting chart on Analyticbridge (produced with Tableau). Read this article to find out more about the algorithms used to produce these maps. And here are 14 questions about visualization tools

Question: do you think that these graphs are

  • beautiful but useless?
  • beautiful and useful?
  • not beautiful, and useless?
  • not beautiful, but useful?

Related articles:

fdg

A normalized map of e-mail density between countries, where closer proximity indicates more e-mail. The colors correspond to Huntington’s “civilizations.” (Bogdan State et al)

The Internet was supposed to let us bridge continents and cultures like never before. But after analyzing more than 10 million e-mails from Yahoo! mail, a team of computer researchers noticed an interesting phenomenon: E-mails tend to flow much more frequently between countries with certain economic and cultural similarities.

Among the factors that matter are GDP, trade, language, non-Commonwealth colonial relations, and a couple of academic-sounding cultural metrics, like power-distance, individualism, masculinity and uncertainty. (More on those later.)

The findings were released in a paper titled “The Mesh of Civilizations and International Email Flows,” written by researchers at Stanford, Cornell, Yahoo! and Qatar’s Computational Research Institute.

Read the full article.

Read more…

Great Machine Learning Infographics

Originally posted by Shivon Zillis (Bloomber beta investor) at Shivonzillis.com.

DSC Resources

Additional Reading

Follow us on Twitter: @DataScienceCtrl | @AnalyticBridge

Read more…

Featured Blog Posts - DSC

Webinar Series

Follow Us

@DataScienceCtrl | RSS Feeds