Subscribe to our Newsletter

All Posts (218)

Great Machine Learning Infographics

Originally posted by Shivon Zillis (Bloomber beta investor) at

DSC Resources

Additional Reading

Follow us on Twitter: @DataScienceCtrl | @AnalyticBridge

Read more…

30 Basic Tools For Data Visualization

Posted on FastCodeDesign.


iCharts is a platform that connects the publishers of market research, economic and industry data with professional consumers. iCharts hosts tens of thousands of charts in business, economy, sports, and other categories. iChart makes it simple for people to discover and follow the world’s latest data insights. iCharts provides cloud-based and patented charting tool that enable companies and individuals to brand, market, and share their data as chart content to millions of viewers across the web. icharts provides free accounts to the users which let you create basic interactive charts, while you can buy the premium version as well with tons of features. Charts can have interactive elements, and can pull data from Google Docs, Excel spreadsheets, and other sources. [Link]

Click here to view similar graphics.


FusionCharts Suite XT is a professional and premium JavaScript chart library that enables us to create any type of chart. It uses SVG and has support for 90+ chart types, including 3-D, gantt, funnel, various gauges, and even maps of world/continents/countries/states. Also, most of the charts have both 2-D and 3-D versions. Charts are completely customizable. The labels, fonts, colors, borders, etc. can all be changed. And, they are heavily interactive with tooltips, clickable legend keys, drill-down, zooming/scrolling, and one-click chart export or print. [Link]

Modest Maps

Modest Maps is a small, extensible, and free library for designers and developers who want to use interactive maps in their own projects. It provides a core set of features in a tight, clean package with plenty of hooks for additional functionality. [Link]

View full list. (30 tools)

Related links

Originally posted on Data Science Central

Read more…

Unstructured Data: InfoGraphics

Originally posted on Big Data News


Submitted by Ronan Keane.

DSC Resources

Additional Reading

Follow us on Twitter: @DataScienceCtrl | @AnalyticBridge

Read more…

Data Science Wars: R versus Python

Nice infographics by DataCamp. Click here to view the original and commented version. 

DSC Resources

Additional Reading

Follow us on Twitter: @DataScienceCtrl | @AnalyticBridge

Read more…

Guest blog post by Vincent Granville

This is a follow-up to our article The fastest growing data science / big data profiles on Twitter. Now our intern Livan has developped an app that displays the number of new followers and total number of followers for the most popular data science / big data Twitter profiles, for any time period, allowing you to detect trends with the naked eye. Click here to check out the app and play with it (move the various cursors on the left-hand side, to make the corresponding chart appear on your screen). Below is a screenshot. Click here to see a sample of the Python code for this Twitter API. Still in Beta mode! 

Related article

Read more…

Guest blog post by Data Science Girl

Fantastic resource created by Andrea Motosi. I've only included the 5 categories that are the most relevant to our audience, though it has 31 categories total, including a few on distributed systems and Hadoop. Click here to view the 31 categories. You might also want to check our our our internal resources (the first section below).

Source: Machine Learning and Face Recognition Papers

Data Science Central - Resources

Machine Learning

  • Apache Mahout: machine learning library for Hadoop
  • Ayasdi Core: tool for topological data analysis
  • brain: Neural networks in JavaScript
  • Cloudera Oryx: real-time large-scale machine learning
  • Concurrent Pattern: machine learning library for Cascading
  • convnetjs: Deep Learning in Javascript. Train Convolutional Neural Networks (or ordinary ones) in your browser
  • Decider: Flexible and Extensible Machine Learning in Ruby
  • etcML: text classification with machine learning
  • Etsy Conjecture: scalable Machine Learning in Scalding
  • Google Sibyl: System for Large Scale Machine Learning at Google
  • H2O: statistical, machine learning and math runtime for Hadoop
  • IBM Watson: cognitive computing system
  • MLbase: distributed machine learning libraries for the BDAS stack
  • MLPNeuralNet: Fast multilayer perceptron neural network library for iOS and Mac OS X
  • nupic: Numenta Platform for Intelligent Computing: a brain-inspired machine intelligence platform, and biologically accurate neural network based on cortical learning algorithms
  • PredictionIO: machine learning server buit on Hadoop, Mahout and Cascading
  • scikit-learn: scikit-learn: machine learning in Python
  • Spark MLlib: a Spark implementation of some common machine learning (ML) functionality
  • Sparkling Water: combine H2OÕs Machine Learning capabilities with the power of the Spark platform
  • Vahara: Machine learning and natural language processing with Apache Pig
  • Viv: global platform that enables developers to plug into and create an intelligent, conversational interface to anything
  • Vowpal Wabbit: learning system sponsored by Microsoft and Yahoo!
  • WEKA: suite of machine learning software
  • Wit: Natural Language for the Internet of Things
  • Wolfram Alpha: computational knowledge engine


  • Arbor: graph visualization library using web workers and jQuery
  • CartoDB: open-source or freemium hosting for geospatial databases with powerful front-end editing capabilities and a robust API
  • Chart.js: open source HTML5 Charts visualizations
  • Crossfilter: avaScript library for exploring large multivariate datasets in the browser. Works well with dc.js and d3.js
  • Cubism: JavaScript library for time series visualization
  • Cytoscape: JavaScript library for visualizing complex networks
  • D3: javaScript library for manipulating documents
  • DC.js: Dimensional charting built to work natively with crossfilter rendered using d3.js. Excellent for connecting charts/additional metadata to hover events in D3
  • Envisionjs: dynamic HTML5 visualization
  • Freeboard: pen source real-time dashboard builder for IOT and other web mashups
  • Gephi: An award-winning open-source platform for visualizing and manipulating large graphs and network connections
  • Google Charts: simple charting API
  • Grafana: graphite dashboard frontend, editor and graph composer
  • Graphite: scalable Realtime Graphing
  • Highcharts: simple and flexible charting API
  • IPython: provides a rich architecture for interactive computing
  • Keylines: toolkit for visualizing the networks in your data
  • Matplotlib: plotting with Python
  • NVD3: chart components for d3.js
  • Peity: Progressive SVG bar, line and pie charts
  • Easy-to-use web service that allows for rapid creation of complex charts, from heatmaps to histograms. Upload data to create and style charts with Plotly’s online spreadsheet. Fork others’ plots.
  • Recline: simple but powerful library for building data applications in pure Javascript and HTML
  • Redash: open-source platform to query and visualize data
  • Sigma.js: JavaScript library dedicated to graph drawing
  • Vega: a visualization grammar

Graph Databases

  • Apache Giraph: implementation of Pregel, based on Hadoop
  • Apache Spark Bagel: implementation of Pregel, part of Spark
  • ArangoDB: multi model distribuited database
  • Facebook TAO: TAO is the distributed data store that is widely used at facebook to store and serve the social graph
  • Faunus: Hadoop-based graph analytics engine for analyzing graphs represented across a multi-machine compute cluster
  • Google Cayley: open-source graph database
  • Google Pregel: graph processing framework
  • GraphLab PowerGraph: a core C++ GraphLab API and a collection of high-performance machine learning and data mining toolkits built on top of the GraphLab API
  • GraphX: resilient Distributed Graph System on Spark
  • Gremlin: graph traversal Language
  • InfiniteGraph: distributed graph database
  • Infovore: RDF-centric Map/Reduce framework
  • Intel GraphBuilder: tools to construct large-scale graphs on top of Hadoop
  • MapGraph: Massively Parallel Graph processing on GPUs
  • Neo4j: graph database writting entirely in Java
  • OrientDB: document and graph database
  • Phoebus: framework for large scale graph processing
  • Sparksee: scalable high-performance graph database
  • Titan: distributed graph database, built over Cassandra
  • Twitter FlockDB: distribuited graph database


  • Actian Ingres: commercially supported, open-source SQL relational database management system
  • BayesDB: statistic oriented SQL database
  • Cockroach: Scalable, Geo-Replicated, Transactional Datastore
  • Datomic: distributed database designed to enable scalable, flexible and intelligent applications
  • FoundationDB: distributed database, inspired by F1
  • Google F1: distributed SQL database built on Spanner
  • Google Spanner: globally distributed semi-relational database
  • H-Store: is an experimental main-memory, parallel database management system that is optimized for on-line transaction processing (OLTP) applications
  • HandlerSocket: NoSQL plugin for MySQL/MariaDB
  • IBM DB2: object-relational database management system
  • InfiniSQL: infinity scalable RDBMS
  • MemSQL: in memory SQL database witho optimized columnar storage on flash
  • NuoDB: SQL/ACID compliant distributed database
  • Oracle Database: object-relational database management system
  • Oracle TimesTen in-Memory Database: in-memory, relational database management system with persistence and recoverability
  • Pivotal GemFire XD: Low-latency, in-memory, distributed SQL data store. Provides SQL interface to in-memory table data, persistable in HDFS
  • SAP HANA: is an in-memory, column-oriented, relational database management system
  • SenseiDB: distributed, realtime, semi-structured database
  • Sky: database used for flexible, high performance analysis of behavioral data
  • SymmetricDS: open source software for both file and database synchronization
  • Teradata Database: complete relational database management system
  • VoltDB: in-memory NewSQL database


Related articles (Internal to DataScienceCentral)

Read more…

Guest blog post by Vincent Granville

Benzene (C6H6) is the last fundamental chemical compound to have had is atomic structure uncovered. This discovery led the path to

  • Creating large numbers of complex synthetic molecules using computer simulations and deep mathematics (which in turn led to the explosion of synthetic drugs in big pharma)
  • Creating very useful but weird carbon molecules using computer simulations and deep mathematics (wich in turn led to the explosion of nanotechnologies)

Big innovation in combinatorial chemistry today is rooted in the discovery and understanding of complex, unusual, bizarre atomic structures such as Benzene, and the application of advanced analytic principles.

Let's start with a scratch course in chemistry (section 1). Then I'll explain how analytics helps create these incredibly powerful and useful new technologies (section 2).

1. Basic Chemistry Tutorial

Molecules are made of atoms. Atoms are the "prime number" entities that generate all the molecules. There are about 100 types of atoms in our universe, ordered by their atomic number: for a comprehensive list, check out the periodic table of elements. Molecules that contain at least two different types of atoms are called compounds.

Examples of atoms include H (Hydrogen), C (Carbon), O (Oxygen), Cl (Chlorine), Na (Sodium). Example of compounds include C6H6 (benzene), H2O (water), NaCl (salt), CH4 (methane).

With the 100 or so fundamental atoms, one can create an infinite number of molecules, and a finite number (say f(n)) of molecules with exactly n atoms. Note that benzene molecules have 12 atoms (6 carbon + 6 hydrogen), water have 3, salt have 2 and methane have 5. Also, the function f(n) grows incredibly fast, much faster than exponential. The possibilities for new molecule creations (that is, combination of atoms) are endless. The word combinatorial chemistry has been used in this context.

Organic chemistry is about molecules that contain H, O, C (in any quantity) and no other atoms. Not all combinations are permitted or stable, for instance HO does not exist, while H2O does.


What determines if a molecule exists or not is based on the number of electrons on the outermost layer of each of its individual atoms, and whether or not bonds can be created to (ideally) have the equivalent 8 electrons on the outermost layers by sharing electrons with adjacent atoms within the molecule.. 

Bonds can be single, double, triple or quadruple. 

  • Hydrogen: has 1 electron on outermost layer; needs single bonds to associate with other atoms
  • Oxygen: 6 electrons on outermost layer; needs double bonds (or less)
  • Carbon: 4 electrons on outermost layers;  needs six quadruple bonds (or less)

Examples of bonding: water (H2O) at the top, salt (NaCl) at the bottom 

Oxygen has single covalent bonding with each of the two Hydrogen atoms       Sodium lets Chlorine use its valance electron

When Na (Sodium) and Cl (Chlore) bond together to form NaCl (Salt), the isolated Na electron on the outermost layer bond to the outermost layer of the Cl atom, and equilibrium (8 atoms on the outermost layer) is reached. Same with water.

2. The Analytic Path to Innovation

Now let's discuss the two applications introduced earlier.

A test for data science applicants

First, stop reading, and answer the following question: how can benzene be represented, given its formula C6H6, and the bonding constraints described above? Everything you need to answer this question is explained above. It is indeed a difficult question - the kind that Google would love to ask to future hires, and it took decades before a solution was eventually found. (I provided the answer in the picture at the bottom; a line segment represents a single bond; a double line segment represents a double bond)

2-D organic molecules

Drug companies have been among the first to realize that it could make sense to create a catalog of all the potential molecules of at most n atoms, each atom being either H, C or O. These millions of potential molecules can be easily clustered (using statistical clustering techniques) and their medicinal properties can be guessed even before the first one is manufactured, or even if it can't be manufactured. Currently, about 100,000 such molecules are created each year, with the help of computer simulation. 

3-D carbon molecules

What if instead of using H, O and C, you use just C alone? Can you create C2, C3, C4, and so forth? Which C's can you create, which ones can not exist?

Turns out that this is a very difficult question. Instead of creating planar molecules, you must consider molecules with a 3-dimensional atomic structure. The first to be discovered was buckminsterfullerene (C60) in 1985. The atomic structure (sphere) looks like a soccer ball. I'm not sure it has any value, but it led to the discovery of a famous class of (cylinder) carbon molecules called nanotubes, the most notorious being C74. These molucles are about to create a new industribal revolution, with the creation of incredibly strong light (one atom thick!!) cables with incredible thermal and electrical properties.

Dodecahedron t12 v.png

Buckminsterfullerene (C60)

Nanotubes (C74)

Benzene (C6H6)

Related articles:

Read more…

4 Business Benefits of Data Visualization

Guest blog post by Sreeram Sreenivasan

Each year, the amount of data being created is increasing tremendously. Due to the massive availability of data, businesses are becoming more & more number-driven in their decision making.

However, all this data that’s being created by people & devices doesn’t provide executives and other decision makers with valuable insights on its own. It must be organized and analyzed to get any meaningful value. To discover new business opportunities, business leaders & executives need to be able to analyze and interpret data in real-time.

Data visualization techniques & tools offer key stakeholders and other decision makers, the ability to quickly grasp information hiding in their data.

Here are the top five benefits that data visualization provides to organizations & decision makers.

1. Absorb more information easily

Data visualization enables users to view & understand vast amounts of information regarding operational and business conditions. It allows decision makers to see connections between multi-dimensional data sets and provides new ways to interpret data through heat maps, bullet charts, and other rich graphical representations. Businesses that use visual data analytics are more likely to find the information they are looking for and sooner than other companies. A survey by Aberdeen Group found that managers in organizations that use visual data discovery tools are 28 percent more likely to find relevant information compared to those who rely only on managed dashboards & reporting.

2. Discover relationships & patterns between business & operational activities

Data visualization enables users to effectively see patterns and relations that occur between operations and business performance. It’s easier to see how your day-to-day tasks impact your overall business performance, and find out which operational change triggered the growth/dip in business performance.

Data visualization allows you to see historical trends in the key performance metrics, like monthly sales, of your business. This allows you to compare the current performance with the past, and forecast the future. You can even break it out by various components that drive the metric, such as the sales by various sources or regions. Further, you can drill down to see their historical trends and contributions.

This allows the executives to identify the reason for growth and repeat it, or find the root cause of the dip and fix it.

3. Identify emerging trends faster

The amount of customer & market data that companies are able to gather can provide key insights into new business & revenue opportunities. To avoid getting lost in the mountain of data, it needs to be simplified. That’s where data visualization comes in. Data visualization enables decision makers to spot changes in customer behavior and market conditions more quickly. For example, a supermarket chain can use data visualization to see that not only do customers spend more as macro-economic conditions improve, but they increasingly purchase ready-made foods.

4. Directly interact with data

Data visualization enables you to bring actionable insights to the surface. Unlike tables and charts which can only be viewed, data visualization tools enable users to interact with data. For example, a canned report or Excel spreadsheet can inform an executive for an automotive company that sales for its sedans are down for a particular month. However, it won’t inform her why sales of the sedans are down. Using real-time data visualization, the executive can view the latest sales figures and see which models of sedans are underperforming and the reason for drop in sales – discount offered by competitors. The executive can take action to respond quickly. For example, she might launch a 15-day sales promotion for specific dealerships where sales numbers have dipped.

Empowering executives with data visualization can provide new ways of looking at business strategy & operations, and enable senior management to drive business performance better.

About the author:

Sreeram is the founder of Ubiq, a web-based Business Intelligence & Reporting Application.

Read more…

Guest blog post by Vincent Granville

There has been a number of interesting articles recently, discussing the skills a data scientist should or might have. The one entitled The 22 Skills of a Data Scientist is a popular one (see 22 skills listed below, or click on the link to read the full article). Earlier this morning, I read another one on LinkedIn: Data Scientist – MUST have skills?. The picture below comes from this LinkedIn article. Some of these articles have been posted on our network, by external bloggers, for instance, skills you need to become a data scientist or Some software and skills that every Data Scientist should know. Popular ones include how to become a data scientist and Are You A Data Scientist ?

Embedded image permalink

I tend to have some level of disagreement with many of these authors. My disagreement can be summarized as follows:

  • Rather than defining data scientists by a bunch of skills that few employees possess (though many analytic executives possess all of them and more), it makes more sense to divide data scientists in multiple categories: data engineers, machine learning experts, modelers, business-oriented data scientists, researchers, domain experts, generalists etc each possessing a separate skillset. Google six categories of data scientists for details.
  • Also, you can train data scientists to have all the required skills. Colleges do a poor job at that, focusing instead on delivering silo-ed, outdated curricula, and being out of touch with the real world. Some modern 6-month training will teach the foundations for self-learners, that's the purpose of our free data science apprenticeship using a project-based approach (real-life projects), though there are other alternatives.

The 22 skills in question 

Would you add or remove some to this great list created by Matt Reany? First, I'd categorize these skills. Then, I certainly would add business acumen, domain expertise, hacking skills, presentation and listening skills, good judgment, not trusting models, ability to work in a team or with clients, all sorts of databases and file management systems, some data engineering, some data architecture and dashboard design, data detection, real time analytics, data vendor expert (vendor selection, benchmarking), be the metric expert in your company (even decide which metrics to track, how to collect the data). 

  • Algorithms (ex: computational complexity, CS theory) DD,DR
  • Back-End Programming (ex: JAVA/Rails/Objective C) DC, DD
  • Bayesian/Monte-Carlo Statistics (ex: MCMC, BUGS) DD, DR
  • Big and Distributed Data (ex: Hadoop, Map/Reduce) DB, DC, DD
  • Business (ex: management, business development, budgeting) DB
  • Classical Statistics (ex: general linear model, ANOVA) DB, DC, DR
  • Data Manipulation (ex: regexes, R, SAS, web scraping) DC, DR
  • Front-End Programming (ex: JavaScript, HTML, CSS) DC, DD
  • Graphical Models (ex: social networks, Bayes networks) DD, DR
  • Machine Learning (ex: decision trees, neural nets, SVM, clustering) DC, DD
  • Math (ex: linear algebra, real analysis, calculus) DD,DR
  • Optimization (ex: linear, integer, convex, global) DD, DR
  • Product Development (ex: design, project management) DB
  • Science (ex: experimental design, technical writing/publishing) DC, DR
  • Simulation (ex: discrete, agent-based, continuous) DD,DR
  • Spatial Statistics (ex: geographic covariates, GIS) DC, DR
  • Structured Data (ex: SQL, JSON, XML) DC, DD
  • Surveys and Marketing (ex: multinomial modeling) DC, DR
  • Systems Administration (ex: *nix, DBA, cloud tech.) DC, DD
  • Temporal Statistics (ex: forecasting, time-series analysis) DC, DR
  • Unstructured Data (ex: noSQL, text mining) DC, DD
  • Visualisation (ex: statistical graphics, mapping, web-based data‐viz) DC, DR

Related articles

Read more…

How can organizations use data visualization, visual analytics, and visual data discovery to improve decision-making, collaboration, and operational execution? We present three key insights from the latest TDWI research.

Guest blog by David Stodder, TDWI Research. Article originally published here.

Data visualization is one of the innovations of our time. From the moment most of us wake up in the morning and fire up our tablets, smartphones, and laptops, visual representations of data fill our lives. Developments in, for example, stock markets, sports, and science, are increasingly told through data visualization. We encounter beautifully rendered “infographics” to explain trends and patterns in data. News organizations such as The New York Times> compete on analytics by serving up infographics to shed light on aspects of news stories that would otherwise be buried in text. Such infographics are shared widely in blogs and social media, turning what might otherwise have been obscure data findings into the day’s biggest buzz.

Thus, although organizations need to be mindful of business users who have visual impairments, it is clear that their investments in data visualization libraries, tools, and applications -- and the professionals who know how to implement them -- are worthwhile. Visualization is often the best and most persuasive way of communicating quantitative insights. “We acquire more information through vision than through all of the other senses combined,” wrote Colin Ware, in his book Information Visualization. “The 20 billion or so neurons of the brain devoted to analyzing visual information provide a pattern-finding mechanism that is a fundamental component of our cognitive activity.”

The broad popularity of infographics is pushing higher expectations for data visualization and graphical interaction capabilities in business intelligence and analytics tools. Visualization, coupled with the expanding use of analytics, should be a key concern for data management because its growth affects how data is provisioned for users and the value they gain from it. Good data visualization is critical to making smarter decisions and improving productivity; poorly created visualizations, on the other hand, can mislead users and make it more difficult for them to overcome the daily data onslaught.

Data Visualization and Discovery: Research Insights

I recently finished writing a TDWI Research Report, Data Visualization and Discovery for Better Business Decisions, which will be published in early July and is the subject of an upcoming TDWI Webinar. “Visual discovery,” which brings together data visualization with easy-to-use, self-directed data analysis is, of course, one of the hottest trends in BI and analytics. Users are implementing visual discovery tools to explore data relationships between and across sources, perform what-if analysis, and discover previously unseen patterns and trends in data.  (

The research report, which includes the results of an extensive survey of the TDWI community, focuses on how organizations can use data visualization, visual analytics, and visual data discovery to improve decision-making, collaboration, and operational execution. Briefly, here are three insights from the report.

Insight #1: Future plans are focused on analytics

Three out of five (60 percent) of respondents said that their organizations are currently using data visualization for display or snapshot reports, and/or scorecards. Far fewer (33 percent) are currently implementing data visualization for discovery and analysis, and only about a quarter (26 percent) of respondents are doing so for operational alerting. However, larger percentages say they plan to employ data visualization for these latter two types of requirements (45 percent and 39 percent, respectively). The results suggest that although dashboard reporting currently dominates BI visualization, the future focus is on visual analytics and, to a somewhat lesser extent, real-time operational alerting.

Insight #2: Marketing functions are the biggest users of visual data discovery and analysis

TDWI Research finds that the function for which visual data discovery and analysis capabilities are most important is marketing. Both survey analysis and my interviews pointed to the marketing function as the user bastion driving the need for easy, self-directed visual analysis of customer and market data. The most important types of visualizations for executive management remain standard dashboard displays or snapshot reports, as well as scorecards. It appears that visual discovery and analysis has not yet become the mainstream for business functions beyond marketing.

Insight #3: Time series analysis is an important focus for visualization 

A significant percentage of respondents implement visualizations for time-series analysis (39 percent). Users in most organizations need to analyze change over time, and they typically use various line charts for this purpose. Some will apply more exotic visualizations (such as scatter plots) for specialized time-series analysis, including examining correlations over time between multiple data sources. Time-series, pattern, and trend analysis complement predictive analysis. Almost a third (32 percent) of respondents use visualizations for forecasting, modeling, and simulation, and 22 percent are doing so for predictive analysis.


The Right Visualizations for the Right Activities

Visualization is exciting, but organizations have to avoid the impulse to clutter users’ screens with nothing more than confusing “eye candy.” One important way to do this is to evaluate closely who needs what kind of visualizations. Not all users may need interactive, self-directed visual discovery and analysis; not all need real-time operational alerting. With adroit use of visualization, however, users will be able to understand and communicate data far more effectively.

©2013 by TDWI (The Data Warehousing Institute), a division of 1105 Media, Inc. Reprinted with permission. Visit for more information.

Read more…

Building blocks of data science

Guest blog post by Vincent Granville

I saw a chart on Twitter, about the precursors to data science. I was surprised to see that their graph is missing many links and entities such as computer science. Data science contributes more to statistics than the other way around. This view is biased towards making statistics the birth place of all modern data processing.

I then decided to create my own graph, summarizing my thoughts on data science. My graph is illustrated in figure 1, the other one in figure 2. Which one do you agree most with? Note that in my graph, an arrow from data science to statistics means that data science contributes to statistics. 

Figure 1: The Building Blocks of Data Science, by Vincent Granville

Figure 2: The Building Precursors to Data Science, probably by Diego Kuonen

DSC Resources

Additional Reading

Follow us on Twitter: @DataScienceCtrl | @AnalyticBridge

Read more…

Guest blog post by Renette Youssef

It probably comes as no surprise, but we talk to a lot of data scientists at CrowdFlower. We like learning the tools they use, the programs that make their lives easier, and how everything works together. Today, we'll really pleased to unveil the first of a three-part series about the data science ecosystem. Here it is in infographic form because, let's face it, everybody likes infographics: 

Read more…

10 types of regressions. Which one to use?

Originally posted here by Dr Granville. Read comments posted about the original article.

Should you use linear or logistic regression? In what contexts? There are hundreds of types of regressions. Here is an overview for data scientists and other analytic practitioners, to help you decide on what regression to use depending on your context. Many of the referenced articles are much better written (fully edited) in my data science Wiley book.

Click here to see source, for this picture

  • Linear regression: Oldest type of regression, designed 250 years ago; computations (on small data) could easily be carried out by a human being, by design. Can be used for interpolation, but not suitable for predictive analytics; has many drawbacks when applied to modern data, e.g. sensitivity to both ouliers and cross-correlations (both in the variable and observation domains), and subject to over-fitting. A better solution is piecewise-linear regression, in particular for time series.
  • Logistic regression: Used extensively in clinical trials, scoring and fraud detection, when the response is binary (chance of succeeding or failing, e.g. for a new tested drug or a credit card transaction). Suffers same drawbacks as linear regression (not robust, model-dependent), and computing regression coeffients involves using complex iterative, numerically unstable algorithm. Can be well approximated by linear regression after transforming the response (logit transform). Some versions (Poisson or Cox regression) have been designed for a non-binary response, for categorical data (classification), ordered integer response (age groups), and even continuous response (regression trees).
  • Ridge regression: A more robust version of linear regression, putting constrainsts on regression coefficients to make them much more natural, less subject to over-fitting, and easier to interpret. Click here for source code.
  • Lasso regression: Similar to ridge regression, but automatically performs variable reduction (allowing regression coefficients to be zero). 
  • Ecologic regression: Consists in performing one regression per strata, if your data is segmented into several rather large core strata, groups, or bins. Beware about the curse of big data in this context: if you perform millions of regressions, some will be totally wrong, and the best ones will be overshadowed by noisy ones with great but artificial goodness-of-fit: a big concern if you try to identify extreme events and causal relationships (global warming, rare diseases or extreme flood modeling).Here's a fix to this problem.
  • Regression in unusual spaces: click here for details. Example: to detect if meteorite fragments come from a same celestial body, or to reverse-engineer Coca-Cola formula.
  • Logic regression: Used when all variables are binary, typically in scoring algorithms. It is a specialized, more robust form of logistic regression (useful for fraud detection where each variable is a 0/1 rule), where all variables have been binned into binary variables.
  • Bayesian regression: see entry in Wikipedia. It's a kind of penalized likehood estimator, and thus somewhat similar to ridge regression: more flexible and stable than traditional linear regression. It assumes that you have some prior knowledge about the regression coefficients.and the error term - relaxing the assumption that the error must have a normal distribution (the error must still be independent across observations). However, in practice, the prior knowledge is translated into artificial (conjugate) priors - a weakness of this technique.
  • Quantile regression: Used in connection with extreme events, read Common Errors in Statistics page 238 for details.
  • LAD regression: Similar to linear regression, but using absolute values (L1 space) rather than squares (L2 space). More robust, see also our L1 metric to assess goodness-of-fit (better than R^2) and our L1 variance (one version of which is scale-invariant).
  • Jackknife regression: This is the new type of regression, also used as general clustering and data reduction technique. It solves all the drawbacks of traditional regression. It provides an approximate, yet very accurate, robust solution to regression problems, and work well with "independent" variables that are correlated and/or non-normal (for instance, data distributed according to a mixture model with several modes). Ideal for black-box predictive algorithms. It approximates linear regression quite well, but it is much more robust, and work when the assumptions of traditional regression (non correlated variables, normal data, homoscedasticity) are violated.

Note: Jackknife regression has nothing to do with Bradley Efron's Jackknife, bootstrap and other re-sampling techniques published in 1982; indeed it has nothing to do with re-sampling techniques.

Other Solutions

  • Data reduction can also be performed with our feature selection algorithm.
  • It's always a good idea to blend multiple techniques together to improve your regression, clustering or segmentation algorithms. An example of such blending is hidden decision trees.
  • Categorical independent variables such as race, are sometimes coded using multiple (binary) dummy variables.

Before working on any project, read our article on the lifecycle of a data science project.

Read more…

Guest blog post by Peter Chen

1. Introduction

In this post, we’ll use a supervised machine learning technique called logistic regression to predict delayed flights. But before we proceed, I like to give condolences to the family of the the victims of the Germanwings tragedy.

This analysis is conducted using a public data set that can be obtained here:


Note: This is a common data set in the machine learning community to test out algorithms and models given it’s publicly available and have sizable data.

In this blog, we will look at small sample snapsot(2201 flights in January 2004). In another post, we can explore using Big Data technologies such as Hadoop MapReduce or Spark machine learning libraries to do large scale predictive analytics and data mining.

Let’s load in our small sample set here and see the first 5 rows of data:

## 1 1455 OH 1455 JFK 184 37987 5935 BWI
## 2 1640 DH 1640 JFK 213 37987 6155 DCA
## 3 1245 DH 1245 LGA 229 37987 7208 IAD
## 4 1715 DH 1709 LGA 229 37987 7215 IAD
## 5 1039 DH 1035 LGA 229 37987 7792 IAD
## Weather DAY_WEEK DAY_OF_MONTH TAIL_NUM Flight.Status
## 1 0 4 1 N940CA ontime
## 2 0 4 1 N405FJ ontime
## 3 0 4 1 N695BR ontime
## 4 0 4 1 N662BR ontime
## 5 0 4 1 N698BR ontime

We see the following variables are collect:

  • Departure Time(CRS_DEP_TIME)
  • Carrier, Destination(DEST)
  • Distance
  • Flight Date(FL_DATE)
  • Flight Number(FL_NUM)
  • Weather(Code of 1 represents there was a weather-related delay)
  • Day of the Week
  • Day of the Month
  • Tail Number
  • Flight Status(where it’s ontime or delayed)

The goal here is to identify flights that are likely to be delayed. In the machine learning literature this is called a binary classification using supervised learning. We are bucketing flights into delayed or ontime(hence binary classification). (Note: Prediction and classification are two main big goals of data mining and data science. On a deeper philosophical level, they are two sides of the same coin. To classify things is predicting as well if you think about it.)

Logistic regression provides us with a probability of belonging to one or the two cases(delayed or ontime). Since probability ranges from 0 to 1, we will use the 0.5 cutoff to determine which bucket to put our probability estimates in. If the probability estimate from the logistic regression is equal to or greater tha 0.5 then we assign it to be ontime else it’s delayed. We’ll explain the theory behind logistic regression in another post.

But before we start our modeling exercise, it’s good to take a visual look at what we are trying to predict to see what it looks like. Since we are trying to predict delayed flights with historical data, let’s do a simple histogram plot to see the distribution of flights delayed vs. ontime:

We see that most flights are ontime(81%, as expected). But we need to have delayed flights in our dataset in order to train the machine to learn from this delayed subset to predict if future flights will be delayed.

2. Exploratory Data Analysis (EDA):

The next step in predictive analytics is to explore our underlying data. Let’s do a few plots of our explantory variables to see how they look against Delayed Flights.

Carriers Distribution in the Data Set

Carrier  Count  Percentage
CO 94 4.3%
DH 551 25%
DL 388 17.6%
MQ 295 13.4%
OH 30 1.4%
RU 408 18.5%
UA 31 1.4%
US 404 18.4%

Please note the following:

  • CO: Continental
  • DH: Atlantic Coast
  • DL: Delta
  • MQ: American Eagle
  • OH: Comair
  • RU: Continental Express
  • UA: United
  • US: US Airways

Let’s example Day of the Week effect. We see that Mondays and Sundays have the most delayed flights and Saturdays have the least. Note: 1 is Monday and 7 is Sunday.

Destination airport effect.

Origin airport effect.

3. Data Transformation & Pre-Processing:

One of the main steps in the predictive analytics is data transformation. Data is never in the way you want them. One might have to do some kind of transformations to get it to the way we need them either because the data is dirty, not of the type we want, out of bounds, and a host of other reasons

This first transformation we’ll need to do is to convert the categorical variables into dummy variables.

The four categorical variables of interests are: 1) Carrier 2) Destination (airport codes) 3) Origin (airport codes) 4) Day of the Week. For simplicity of model building, we’ll NOT use Day of the Month, because of the combinatorial explosion in number of dummy variables. The reader is free to do this as an exercise on his/her own. :)

Here’s the first five rows of categorical to dummy variables transformation. There’s a nice handy function in R called model.matrix that helps us with that.

flights.dummy <- model.matrix(~CARRIER+DEST+ORIGIN+DAY_WEEK,data=flights) flights.dummy <- flights.dummy[,-1] head(flights.dummy,5)

## 1 0 0 0 1 0 0 0
## 2 1 0 0 0 0 0 0
## 3 1 0 0 0 0 0 0
## 4 1 0 0 0 0 0 0
## 5 1 0 0 0 0 0 0
## 1 1 0 0 0 0 0 1
## 2 1 0 1 0 0 0 1
## 3 0 1 0 1 0 0 1
## 4 0 1 0 1 0 0 1
## 5 0 1 0 1 0 0 1
## 1 0 0 0
## 2 0 0 0
## 3 0 0 0
## 4 0 0 0
## 5 0 0 0

Then we need to cut/segment the DEP_TIME into sensible buckets. In this case, I’ve divided by hour bucket. Then we need to convert those buckets into dummy variables.

## HourBlockDeptTimeHourBlock1 HourBlockDeptTimeHourBlock10
## 1 0 0
## 2 0 0
## 3 0 0
## HourBlockDeptTimeHourBlock11 HourBlockDeptTimeHourBlock12
## 1 0 0
## 2 0 0
## 3 0 1
## HourBlockDeptTimeHourBlock13 HourBlockDeptTimeHourBlock14
## 1 0 1
## 2 0 0
## 3 0 0
## HourBlockDeptTimeHourBlock15 HourBlockDeptTimeHourBlock16
## 1 0 0
## 2 0 1
## 3 0 0
## HourBlockDeptTimeHourBlock17 HourBlockDeptTimeHourBlock18
## 1 0 0
## 2 0 0
## 3 0 0
## HourBlockDeptTimeHourBlock19 HourBlockDeptTimeHourBlock20
## 1 0 0
## 2 0 0
## 3 0 0
## HourBlockDeptTimeHourBlock21 HourBlockDeptTimeHourBlock22
## 1 0 0
## 2 0 0
## 3 0 0
## HourBlockDeptTimeHourBlock23 HourBlockDeptTimeHourBlock5
## 1 0 0
## 2 0 0
## 3 0 0
## HourBlockDeptTimeHourBlock6 HourBlockDeptTimeHourBlock7
## 1 0 0
## 2 0 0
## 3 0 0
## HourBlockDeptTimeHourBlock8 HourBlockDeptTimeHourBlock9
## 1 0 0
## 2 0 0
## 3 0 0

Let’s join all the variables into one big data frame so that later on we can feed into our logistic regression.

## 1 0 0 0 1 0 0 0
## 2 1 0 0 0 0 0 0
## 1 1 0 0 0 0 0 1
## 2 1 0 1 0 0 0 1
## DAY_WEEK5 DAY_WEEK6 DAY_WEEK7 HourBlockDeptTimeHourBlock1
## 1 0 0 0 0
## 2 0 0 0 0
## HourBlockDeptTimeHourBlock10 HourBlockDeptTimeHourBlock11
## 1 0 0
## 2 0 0
## HourBlockDeptTimeHourBlock12 HourBlockDeptTimeHourBlock13
## 1 0 0
## 2 0 0
## HourBlockDeptTimeHourBlock14 HourBlockDeptTimeHourBlock15
## 1 1 0
## 2 0 0
## HourBlockDeptTimeHourBlock16 HourBlockDeptTimeHourBlock17
## 1 0 0
## 2 1 0
## HourBlockDeptTimeHourBlock18 HourBlockDeptTimeHourBlock19
## 1 0 0
## 2 0 0
## HourBlockDeptTimeHourBlock20 HourBlockDeptTimeHourBlock21
## 1 0 0
## 2 0 0
## HourBlockDeptTimeHourBlock22 HourBlockDeptTimeHourBlock23
## 1 0 0
## 2 0 0
## HourBlockDeptTimeHourBlock5 HourBlockDeptTimeHourBlock6
## 1 0 0
## 2 0 0
## HourBlockDeptTimeHourBlock7 HourBlockDeptTimeHourBlock8
## 1 0 0
## 2 0 0
## HourBlockDeptTimeHourBlock9 Weather FlightStatus
## 1 0 0 ontime
## 2 0 0 ontime

4. Model Building: Logistic Regression

Now, it’s generally NOT a good idea to use your ENTIRE data sample to fit the model. What we want to do is to train the model on a sample of the data. Then we’ll see how it perform outside of our training sample. This breaking up of our data set to training and test set is to evaluate the performance of our models with unseen data. Using the entire data set to build a model then using the entire data set to evaluate how good a model does is a bit of cheating or careless analytics.

We use the a RANDOM sample that is 60% of the data set as the training set. Let’s take a peek at the first 5 rows of the training set.

## 2 1 0 0 0 0 0 0
## 3 1 0 0 0 0 0 0
## 4 1 0 0 0 0 0 0
## 7 1 0 0 0 0 0 0
## 8 1 0 0 0 0 0 0
## 2 1 0 1 0 0 0 1
## 3 0 1 0 1 0 0 1
## 4 0 1 0 1 0 0 1
## 7 1 0 0 1 0 0 1
## 8 1 0 0 1 0 0 1
## DAY_WEEK5 DAY_WEEK6 DAY_WEEK7 HourBlockDeptTimeHourBlock1
## 2 0 0 0 0
## 3 0 0 0 0
## 4 0 0 0 0
## 7 0 0 0 0
## 8 0 0 0 0
## HourBlockDeptTimeHourBlock10 HourBlockDeptTimeHourBlock11
## 2 0 0
## 3 0 0
## 4 0 0
## 7 0 0
## 8 0 0
## HourBlockDeptTimeHourBlock12 HourBlockDeptTimeHourBlock13
## 2 0 0
## 3 1 0
## 4 0 0
## 7 1 0
## 8 0 0
## HourBlockDeptTimeHourBlock14 HourBlockDeptTimeHourBlock15
## 2 0 0
## 3 0 0
## 4 0 0
## 7 0 0
## 8 0 0
## HourBlockDeptTimeHourBlock16 HourBlockDeptTimeHourBlock17
## 2 1 0
## 3 0 0
## 4 0 1
## 7 0 0
## 8 1 0
## HourBlockDeptTimeHourBlock18 HourBlockDeptTimeHourBlock19
## 2 0 0
## 3 0 0
## 4 0 0
## 7 0 0
## 8 0 0
## HourBlockDeptTimeHourBlock20 HourBlockDeptTimeHourBlock21
## 2 0 0
## 3 0 0
## 4 0 0
## 7 0 0
## 8 0 0
## HourBlockDeptTimeHourBlock22 HourBlockDeptTimeHourBlock23
## 2 0 0
## 3 0 0
## 4 0 0
## 7 0 0
## 8 0 0
## HourBlockDeptTimeHourBlock5 HourBlockDeptTimeHourBlock6
## 2 0 0
## 3 0 0
## 4 0 0
## 7 0 0
## 8 0 0
## HourBlockDeptTimeHourBlock7 HourBlockDeptTimeHourBlock8
## 2 0 0
## 3 0 0
## 4 0 0
## 7 0 0
## 8 0 0
## HourBlockDeptTimeHourBlock9 Weather FlightStatus
## 2 0 0 ontime
## 3 0 0 ontime
## 4 0 0 ontime
## 7 0 0 ontime
## 8 0 0 ontime

5. Results with Training Data

Now, let’s feed the training data (60% of our total data set) into our logistic regression model:

## Call:
## glm(formula = FlightStatus ~ ., family = binomial, data = train)
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.9293 0.2287 0.4632 0.6330 1.4940
## Coefficients: (1 not defined because of singularities)
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -18.00948 2399.54481 -0.008 0.994012
## CARRIERDH 1.22308 0.59115 2.069 0.038549 *
## CARRIERDL 1.85836 0.54340 3.420 0.000627 ***
## CARRIERMQ 0.62141 0.51715 1.202 0.229517
## CARRIEROH 2.21278 0.95203 2.324 0.020111 *
## CARRIERRU 1.05671 0.44686 2.365 0.018043 *
## CARRIERUA 1.59265 0.98043 1.624 0.104283
## CARRIERUS 2.10368 0.55575 3.785 0.000154 ***
## DESTJFK -0.02558 0.36198 -0.071 0.943666
## DESTLGA -0.28915 0.36769 -0.786 0.431636
## ORIGINDCA 1.48927 0.42527 3.502 0.000462 ***
## ORIGINIAD 0.50949 0.40285 1.265 0.205968
## DAY_WEEK2 0.27817 0.28915 0.962 0.336039
## DAY_WEEK3 0.43535 0.28308 1.538 0.124070
## DAY_WEEK4 0.48144 0.27512 1.750 0.080136 .
## DAY_WEEK5 0.43374 0.27570 1.573 0.115672
## DAY_WEEK6 1.28499 0.37066 3.467 0.000527 ***
## DAY_WEEK7 -0.02538 0.28445 -0.089 0.928917
## HourBlockDeptTimeHourBlock1 NA NA NA NA
## HourBlockDeptTimeHourBlock10 18.07704 2399.54479 0.008 0.993989
## HourBlockDeptTimeHourBlock11 17.45879 2399.54487 0.007 0.994195
## HourBlockDeptTimeHourBlock12 18.09730 2399.54478 0.008 0.993982
## HourBlockDeptTimeHourBlock13 16.47483 2399.54476 0.007 0.994522
## HourBlockDeptTimeHourBlock14 17.78484 2399.54477 0.007 0.994086
## HourBlockDeptTimeHourBlock15 15.96003 2399.54477 0.007 0.994693
## HourBlockDeptTimeHourBlock16 16.99874 2399.54476 0.007 0.994348
## HourBlockDeptTimeHourBlock17 16.99070 2399.54476 0.007 0.994350
## HourBlockDeptTimeHourBlock18 16.69408 2399.54477 0.007 0.994449
## HourBlockDeptTimeHourBlock19 15.49419 2399.54478 0.006 0.994848
## HourBlockDeptTimeHourBlock20 16.30981 2399.54479 0.007 0.994577
## HourBlockDeptTimeHourBlock21 17.16271 2399.54477 0.007 0.994293
## HourBlockDeptTimeHourBlock22 -0.95332 2538.73457 0.000 0.999700
## HourBlockDeptTimeHourBlock23 -0.51317 2680.37215 0.000 0.999847
## HourBlockDeptTimeHourBlock5 17.46843 2399.54490 0.007 0.994192
## HourBlockDeptTimeHourBlock6 17.90277 2399.54478 0.007 0.994047
## HourBlockDeptTimeHourBlock7 17.51392 2399.54479 0.007 0.994176
## HourBlockDeptTimeHourBlock8 17.94252 2399.54478 0.007 0.994034
## HourBlockDeptTimeHourBlock9 16.07116 2399.54479 0.007 0.994656
## Weather -17.74165 493.32979 -0.036 0.971312
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## (Dispersion parameter for binomial family taken to be 1)
## Null deviance: 1312.7 on 1319 degrees of freedom
## Residual deviance: 1034.1 on 1282 degrees of freedom
## AIC: 1110.1
## Number of Fisher Scoring iterations: 15

The following variables are significant in predicting flights delay according to the model output above:
  • DAY_WEEK6 (Saturday)
  • Origin airport is DCA (Reagan National) Carrier is US Airways
  • Carrier is Delta
  • Carrier is Comair
  • Carrier is Continental Express

Interestingly, the Hour of the Day has no statistical significance in predicting flights delay.

6. Model Evaluation: Logistic Regression

The real test of a good model is to test the model with data that it has not fitted. Here’s where the rubber meets the road. We apply our model to unseen data to see how it performs.

6.1 Prediction using out-of-sample data.

Let’s feed the test data(unseen) to our logistic regression model.

Confusion Matrix We use the confusion matrix to see the performance of the binary classifier, which is what logistic regression is used in this example. Please kindly note that logistic regression can be used for more than binary classification(multi-classes).

Please check out this nice Wikipedia explanation of the Confusion Matrix.

The diagonals of the confusion matrix are the true positive and true negative. The model predicts 42 delays and 706 ontime and it got it right!

On the hand, it predicted 125 flights to be delayed but it was on time. And it predicted 8 to be ontime, but they were actually delayed.

There are three metrics that people look at:

  1. sensitivity (true positive rate or recall): True Positive/(True Positive + False Negative)
  2. specificity (true negative rate): True Negative/(False Positive + True Negative)
  3. accuracy: (True Positive + True Negative)

## Warning: package 'ROCR' was built under R version 3.1.3
## Warning in predict.lm(object, newdata,, scale = 1, type =
## ifelse(type == : prediction from a rank-deficient fit may be misleading
## Confusion Matrix and Statistics
## Reference
## Prediction delayed ontime
## delayed 42 125
## ontime 8 706
## Accuracy : 0.849
## 95% CI : (0.8237, 0.872)
## No Information Rate : 0.9432
## P-Value [Acc > NIR] : 1
## Kappa : 0.3284
## Mcnemar's Test P-Value : <2e-16
## Sensitivity : 0.84000
## Specificity : 0.84958
## Pos Pred Value : 0.25150
## Neg Pred Value : 0.98880
## Prevalence : 0.05675
## Detection Rate : 0.04767
## Detection Prevalence : 0.18956
## Balanced Accuracy : 0.84479
## 'Positive' Class : delayed

Lift Chart (ROC - Receiver Operating Curve):

This is a graphical representation of the relationship between the sensitivity and the false positive rate. The sensitivy is how many correct positive results occur among all positive samples available. False positive rate, on the other hand, is how many incorrect positive results occur among all negative samples available.

The BEST possible prediction model would yield a point in upper left corner(0,1). This would represent perfect classification with no false negative(sensitivity) and no false positive(specificity). So, the higher the ROC or Lift the better the model. The closer the curve is to the straight line the worst it is and closer to random. 


This represents the overall accuracy of the classifier. This can be misleading if there are many more negatives than positives.

7. Conclusion

Hope you enjoyed this and are excited in applying predictive analytics models to your problem space.

In follow on blogs I’ll explain in further details the theories behind these methods and the differences and similarities between them.

Originally posted here.

Read more…

Space is limited.

Reserve your Webinar seat now 

Please join us on June 9th, 2015 at 9am PDT for our latest Data Science Central Webinar Event: Allrecipes: Growing the World's Largest Digital Food Brand Through Data Visualization sponsored by Tableau Software.

Allrecipes, the world's largest digital food brand, received more than a billion visits annually from home cooks around the world, across PCs, smartphones and tablets. Find out how Allrecipes is leveraging data visualization to bring real-time digital food behavior insights to their internal teams, the media, technology partners and the world's largest consumer packaged goods (CPG) brands in actionable, timely, and meaningful ways. The Allrecipes team will discuss their Tableau deployment, growing adoption, and uses for marketing, public relations, sales and improved customer experience.  

Grace Preyapongpisan, Vice President, Business IntelligenceAllrecipes

Hosted by
Tim Matteson, Co-founder, Data Science Central

Title:  Allrecipes: Growing the World's Largest Digital Food Brand Through Data Visualization
Date:  Tuesday, June 9th, 2015
Time:  9:00 AM - 10:00 AM PDT
Again, Space is limited so please register early:
Reserve your Webinar seat now
After registering you will receive a confirmation email containing information about joining the Webinar.
Read more…

Guest blog post by Mirko Krivanek

Leaflet is a modern open-source JavaScript library for mobile-friendly interactive maps. It is developed by Vladimir Agafonkin with a team of dedicated contributors. Weighing just about 33 KB of JS, it has all the features most developers ever need for online maps.

Leaflet is designed with simplicityperformance and usability in mind. It works efficiently across all major desktop and mobile platforms out of the box, taking advantage of HTML5 and CSS3 on modern browsers while still being accessible on older ones. It can be extended with a huge amount of plugins, has a beautiful, easy to use and well-documented API and a simple, readable source code that is a joy to contribute to.

In this basic example, we create a map with tiles of our choice, add a marker and bind a popup with some text to it:

For an interactive map and source code in text format, click here.

Learn more with the quick start guide, check out other tutorials, or head straight to the API documentation
If you have any questions, take a look at the FAQ first.

Related Articles

Read more…

Guest blog post by BDV_Works

Most of the current market solutions are client-side web APIs based, it is very hard to map and plot more than 50,000 location-based data points. Your browser will time-out or fail to response...

Our BDV Engine® is a server-side based software visualization engine that provides a method for organizations to analyze and visualize the Big Data.  The software has the capabilities for Clustering, Rendering and Displaying Big Data on map.   

It is a copyrighted and trademarked software solution that brings unmatched performance, capacity and efficiency to any application interested in using or understanding data in a geographic context. By supporting your enterprise-wide data sources and workflows, It offers a superior means for  handling Big Data (can be as Big as 200 million records), support rapid response, and provide actionable information.  A single server powered by BDV Engine® can outperform the best solutions on the market by at least 10 times in terms of processing powers, capacities and features.

BDV Engine® allows sophisticated geographic/locational queries and analysis of 100 to 200 million records in less than 15 seconds.  For records less than 10 million, the response to your web browsers or mobile devices can be faster than 1 second.

The engine brings true rapid-response, next generation of dynamic mapping/clustering, business intelligence and visualization techniques to the cooperate world.

Key Features:

  • Publishes results in map or table formats to your web browser, mobile device or dashboard. 

  • Integrates seamlessly into any enterprise business platform, using your current system architectures (databases, BI reporting cubes) without affecting security, deployment capabilities or  scalability.

  • Output, in JSON format, is easily consumed and used by web applications and developers, allowing for fast and efficient incorporation into existing applications and databases.

  • Supports Google Map APIs, Bing Map APIs, Esri Map APIs, Nokia (Here) MAP APIs, Oracle Map/Web APIs, Open Source APIs, C#, Java and PHP.

Please check out more details on our site.

Read more…

Guest blog post by Nilesh Jethwa

How has the interest in Big Data, Hadoop, Business Intelligence, Analytics and dashboards changed over the years?

One easy way to gauge the interest is to measure how much news is generated for the related term and Google Trends allows you do that very easily.

After plugging all of the above terms in Google trends and further analysis leads to the following visualizations.

Aggregating the results by year

Before Big Data and Hadoop came into picture the term “Analytics” exhibited a stable ground closer to dashboards but now the trend for Analytics seems to be following Big Data and Hadoop.

Let us take a deeper look into each week since 2004

Click here to view the full interactive Visualizations

Read more…

The death of the statistician

Guest blog post by Vincent Granville

And the rise of the data scientist. These pictures speak better than words. They represent keyword popularity according to Google. These numbers and charts are available on Google.

Other public data sources include Indeed and LinkedIn (number of applicants per job ad), though they tend to be more job market related. 

Feel free to add your charts, for keywords such as newsql, map reduce, R, graph database, nosql, predictive analytics, machine learning, statistics etc.

Related articles

Read more…

Guest blog post by Vincent Granville

Here I propose a solution that can be automated and does not require visual inspection by a human being. The solution can thus be applied to billions of clustering problems, automated and processed in batch mode.

Note that the concept of cluster is a fuzzy one. How do you define a cluster? How many clusters in the chart below?


Nevertheless, in many applications, there's a clear optimum number of clusters. The methodology described here will solve all easy and less easy cases, and will provide a "don't know" answer to cases that are ambiguous.


  • create a 2-dim table with the following rows: number of clusters in row #1, and percentage of variance explained by clusters in row #2. 
  • compute 3rd differences
  • maximum for 3rd differences (if much higher than other values) determine number of clusters

This is based on the fact that the piece-wise linear plot of number of cluster versus percentage of variance explained by clusters is a convex function with an elbow point, see chart below. The elbow point determines the optimum number of clusters. If the piece-wise linear function is approximated by a smooth curve, the optimum would be the point vanishing the 4-th derivative of the approximating smooth curve. This methodology is simply an application of this "elbow detection" technique in a discrete framework (the number of clusters being a discrete number).

 1   2   3   4   5   6   7   8   9   ==> number of clusters

   40  65  80  85  88  90  91  91    ==> percentage of variance explained by clusters 

     25  15   5   3   2   1   0      ==> 1st difference     

      -10  -10  -2  -1  -1  -1       ==> 2nd difference

          0    8   1   0   0         ==> 3rd difference

The optimum number of cluster in this example is 4, corresponding to maximum = 8 in the 3rd differences.


If you have already a strong minimum in the 2nd difference (not the case here), you don't need to go to 3rd difference: stop at level 2.

Read more…

Featured Blog Posts - DSC

Webinar Series

Follow Us

@DataScienceCtrl | RSS Feeds