Subscribe to our Newsletter

Featured Posts (192)

Google is a prolific contributor to Open source. Here is a list of 4 open source & cloud projects from Google focusing on analytics, machine learning, data cleansing & visualization.

TensorFlow

TensorFlow™ is an open source software library for numerical computation using data flow graphs. Nodes in the graph represent mathematical operations, while the graph edges represent the multidimensional data arrays (tensors) communicated between them. The flexible architecture allows you to deploy computation to one or more CPUs or GPUs in a desktop, server, or mobile device with a single API. TensorFlow was originally developed by researchers and engineers working on the Google Brain Team within Google's Machine Intelligence research organization for the purposes of conducting machine learning and deep neural networks research, but the system is general enough to be applicable in a wide variety of other domains as well.

OpenRefine

OpenRefine (formerly Google Refine) is a powerful tool for working with messy data: cleaning it; transforming it from one format into another; and extending it with web services and external data.Please note that since October 2nd, 2012, Google is not actively supporting this project, which has now been rebranded to OpenRefine. Project development, documentation and promotion is now fully supported by volunteers.

Google Charts

Google Charts provides a perfect way to visualize data on your website. From simple line charts to complex hierarchical tree maps, the chart gallery provides a large number of ready-to-use chart types.

The most common way to use Google Charts is with simple JavaScript that you embed in your web page. You load some Google Chart libraries, list the data to be charted, select options to customize your chart, and finally create a chart object with an id that you choose. Then, later in the web page, you create a <div> with that id to display the Google Chart.

Automatic Statistician

Making sense of data is one of the great challenges of the information age we live in. While it is becoming easier to collect and store all kinds of data, from personal medical data, to scientific data, to public data, and commercial data, there are relatively few people trained in the statistical and machine learning methods required to test hypotheses, make predictions, and otherwise create interpretable knowledge from this data. The Automatic Statistician project aims to build an artificial intelligence for data science, helping people make sense of their data.


This article is compiled by Jogmon.

Originally posted on Data Science Central

Read more…

Guest blog and great infographic from our friends at DataCamp.com

Nowadays, the data science field is hot, and it is unlikely that this will change in the near future. While a data driven approach is finding its way into all facets of business, companies are fiercely fighting for the best data analytic skills that are available in the market, and salaries for data science roles are going in overdrive.

 Companies’ increased focus on acquiring data science talent goes hand in hand with the creation of a whole new set of data science roles and titles. Sometimes new roles and titles are added to reflect changing needs; other times they are probably created as a creative way to differentiate from fellow recruiters. Either way, it’s hard to get a general understanding of the different job roles, and it gets even harder when you’re looking to figure out what job role could best fit your personal and professional ambitions.

The Data Science Industry: Who Does What

DataCamp took a look at this avalanche of data science job postings in an attempt to unravel these cool-sounding and playful job titles into a comparison of different data science related careers. We summarized the results in our latest infographic “The Data Science Industry: Who Does What”:

In this infographic we compare the roles of data scientists, data analysts, data architects, data engineers, statisticians and many more. We have a look at their roles within companies and the data science process, what technologies they have mastered, and what the typical skillset and mindset is for each role. Furthermore, we look at the top employers that are currently hiring these different data science roles and how the average national salaries of these roles map out against each other.

Hopefully this infographic will help you to better understand the different job roles that are available to data passionate professionals.

The original blog and infographic can be seen here.

Originally posted on Data Science Central

Read more…

Guest blog shared by Stefan Kingham at Medigo.com

This infographic displays data from The Economist Intelligence Unit’s ‘Healthcare Outcomes Index 2014’, which took into account a number of diverse and complex factors to produce a ranking of the world’s best-performing countries in healthcare (outcome).

In order to produce a rounded set of outcome rankings, the EIU used basic factors like life expectancy and infant mortality rates alongside weighted factors such as Disability-Adjusted Life Years (DALYs) and Health-Adjusted Life Expectancy (HALEs), whilst also taking ageing populations and adult mortality rates into consideration.

The EIU also produced an overview of how much countries spend each year on healthcare per capita. This spending ranking is based on data from the World Health Organization (WHO).

By plotting the EIU’s outcome rankings against spending rankings for each country, we are able to develop a global overview of how effectively countries use their healthcare budgets.

See the original post here.

Originally posted on Data Science Central

Read more…

DataViz for Cavemen

The late seventies are considered as prehistoric times by most data scientists. Yet it was the beginning of a new era, with people getting their first personal computer, or at least programmable calculators like the one pictured below. The operating system was called DOS, and later became MSdos, for Microsoft Disk Operating System. You could use your TV set as a monitor, and tapes and a tape recorder (then later floppy disks) to record data. Memory was limited to 64KB. Regarding the HP 41 model below, the advertising claimed that with some extra modules, you could write up to 2,000 lines of code, and save it permanently. I indeed started my career (and even inverted matrices with it), back in high school, with the very model featured below, offered as a birthday present. Math teachers were afraid by these machines, I believed they were banned from schools at some point.

One of the interesting features in these early times was that there was no real graphic device, not for personal use anyway (sure publishers had access to expensive plotting machines back then). So the trick was to produce graphs and images using only ASCII chars. Typical monitors could display 25 lines, each with 40 characters, in fixed font (courier font). More advanced systems would allow you to switch between two virtual screens, thus extending the length of a line to 80 chars. 

Here are some of the marvels that you could produce back then - now this is considered an art. Has anyone ever made a video using just ASCII chars like in this picture? If anything, it shows how big data is shallow: a 1024 x 1024 image (or a video made up of hundreds of such frames) can be compressed by a factor 2,000 or more, and yet it still conveys pretty much all the useful information available in the big, original version. This brings another question: could this technique be used for face recognition?

This is supposed to be Obama - see details

Click here for details or to download this text file (the image)!

Originally posted on Data Science Central

DSC Resources

Additional Reading

Follow us on Twitter: @DataScienceCtrl | @AnalyticBridge

Read more…

Guest blog by Fabio Souto

A curated list of awesome data visualizations frameworks, libraries and software. Inspired by awesome-python.

Table of contents

JavaScript tools

Charting libraries

  • C3 - a D3-based reusable chart library.
  • Chart.js - Charts with the canvas tag.
  • Charted - A charting tool that produces automatic, shareable charts from any data file.
  • Chartist.js - Responsive charts with great browser compatibility.
  • Dimple - An object-oriented API for business analytics.
  • Dygraphs - Interactive line charts library that works with huge datasets.
  • Echarts - Highly customizable and interactive charts ready for big datasets.
  • Epoch - Perfect to create real-time charts.
  • Highcharts - A charting library based on SVG and VML rendering. Free (CC BY-NC) for non-profit projects.
  • MetricsGraphics.js - Optimized for time-series data.
  • Morris.js - Pretty time-series line graphs.
  • NVD3 - A reusable charting library written in d3.js.
  • Peity - A library to create small inline svg charts.
  • TechanJS - Stock and financial charts.

Charting libraries for graphs

  • Cola.js - A tool to create diagrams using constraint-based optimization techniques. Works with d3 and svg.js.
  • Cytoscape.js - JavaScript library for graph drawing maintained by Cytoscape core developers.
  • Linkurious - A toolkit to speed up the development of graph visualization and interaction applications. Based on Sigma.js.
  • Sigma.js - JavaScript library dedicated to graph drawing.
  • VivaGraph - Graph drawing library for JavaScript.

Maps

  • CartoDB - CartoDB is an open source tool that allows for the storage and visualization of geospatial data on the web.
  • Cesium - WebGL virtual globe and map engine.
  • Leaflet - JavaScript library for mobile-friendly interactive maps.
  • Leaflet Data Visualization Framework - A framework designed to simplify data visualization and thematic mapping using Leaflet.
  • Mapsense.js - Combines d3.js with tile maps.
  • Modest Maps - BSD-licensed display and interaction library for tile-based maps in Javascript.

d3

dc.js

dc.js is an multi-Dimensional charting built to work natively with crossfilter.

Misc

  • Chroma.js - A small library for color manipulation.
  • Piecon - Pie charts in your favicon.
  • Recline.js - Simple but powerful library for building data applications in pure JavaScript and HTML.
  • Textures.js - A library to create SVG patterns.
  • Timeline.js - Create interactive timelines.
  • Vega - Vega is a visualization grammar, a declarative format for creating, saving, and sharing interactive visualization designs.
  • Vis.js - A dynamic visualization library including timeline, networks and graphs (2D and 3D).

Android tools

  • HelloCharts - Charting library for Android compatible with API 8+.
  • MPAndroidChart - A powerful & easy to use chart library.

C++ tools

Golang tools

  • Charts for Go - Basic charts in Go. Can render to ASCII, SVG and images.
  • svgo - Go Language Library for SVG generation.

iOS tools

  • JBChartView - Charting library for both line and bar graphs.
  • PNChart - A simple and beautiful chart lib used in Piner and CoinsMan.
  • ios-charts - iOS port of MPAndroidChart. You can create charts for both platforms with very similar code.

Python tools

  • bokeh - Interactive Web Plotting for Python.
  • matplotlib - A python 2D plotting library.
  • pygal - A dynamic SVG charting library.
  • seaborn - A library for making attractive and informative statistical graphics.
  • toyplot - The kid-sized plotting toolkit for Python with grownup-sized goals.

R tools

  • ggplot2 - A plotting system based on the grammar of graphics.
  • rbokeh - R Interface to Bokeh.
  • rgl - 3D Visualization Using OpenGL

Ruby tools

  • Chartkick - Create beautiful JavaScript charts with one line of Ruby.

Other tools

Tools that are not tied to a particular platform or language.

  • Lightning - A data-visualization server providing API-based access to reproducible, web-based, interactive visualizations.
  • RAW - Create web visualizations from CSV or Excel files.
  • Spark - Sparklines for the shell. It have several implementations in different languages.
  • Periscope - Create charts directly from SQL queries.

Resources

Books

Twitter accounts

Websites

Contributing

  • Please check for duplicates first.
  • Keep descriptions short, simple and unbiased.
  • Please make an individual commit for each suggestion
  • Add a new category if needed.

Thanks for your suggestions!

License

CC0

To the extent possible under law, Fabio Souto has waived all copyright and related or neighboring rights to this work.

Originally posted on Data Science Central

Read more…

Guest blog post by Laetitia Van Cauwenberge

Your best references to do your job or get started in data science.

  1. Machine Learning on GitHub
  2. Supervised Learning on GitHub
  3. Cheat Sheet: Data Visualization with R
  4. Cheat Sheet: Data Visualisation in Python
  5. scikit-learn Algorithm Cheat Sheet
  6. Vincent Granville's Data Science Cheat Sheet - Basic
  7. Vincent Granville's Data Science Cheat Sheet - Advanced
  8. Cheat Sheet – 10 Machine Learning Algorithms & R Commands
  9. Microsoft Azure Machine Learning : Algorithm Cheat Sheet
  10. Cheat Sheet – Algorithm for Supervised and Unsupervised Learning
  11. Machine Learning and Predictive Analytics, on Dzone
  12. ML Algorithm Cheat Sheet by Laurence Diane
  13. CheatSheet: Data Exploration using Pandas in Python
  14. 24 Data Science, R, Python, Excel, and Machine Learning Cheat Sheets .

Click here for picture source

DSC Resources

Additional Reading

Follow us on Twitter: @DataScienceCtrl | @AnalyticBridge

Read more…

If you are a data scientist or analyst looking to implement Tableau or join a Tableau implementation company, here are 8 companies you can evaluate.

Neos.hr

Tableau helps anyone quickly analyze, visualize and share information. Rapid-fire business intelligence gives you the
ability to answer your own questions in minutes.

You can work with all kinds of data, from Hadoop to data warehouses to spreadsheets, and across disparate data sets.

  • Your entire organization is served, from executives to analysts, across departments and geographic locations, in the office or on-the-go.

  • It fits in seamlessly as an extension of your IT infrastructure.

  • You can take advantage of the new generation of user-friendly, visual interfaces to spot outliers and trends in complex data.

  • You, and everyone around you, are self-reliant. When it comes to getting answers from data, you don’t have to wait for anyone or anything.

  • And it’s easy on your budget by providing low cost of ownership and a return on your investment in days or weeks, not months or years.

Interworks.com

Tableau puts analytics in the hands of the user. By enabling individual creativity and exploration from the ground floor, businesses now the ability to adapt and outperform competition through intuitive data visualization and analysis.

Tableau can connect to virtually any data source, be it corporate data warehouse, Microsoft Excel or web-based data. It gives users immediate insights by transforming their data into beautiful, interactive visualizations in a matter of seconds. What took expensive teams days or months to develop, now is achieved through the use of a user-friendly drag-and-drop interface.

Boulder Insights

  • Uncover unseen patterns in your data

  • Navigate sophisticated drag and drop dashboards with ease

  • Analyze millions of rows of data ad hoc and in moments

  • Inform smart business decisions

  • Combine multiple data sources for quick analysis

  • View real-time refreshed dashboards

  • Share beautiful and clear dashboards with colleagues

Syntelli Solutions

Tableau’s rapid-fire business intelligence software lets everyone in your organization analyze and understand their data far faster than any other solution and at a fraction of their costs and resources.

  • Ease of Use:

Tableau Desktop lets you create rich visualizations and dashboards with an intuitive, drag-and-drop interface that lets you see every change as you make it. Anyone comfortable with Excel can get up to speed on Tableau quickly.

  • Direct Connect and Go

There is minimum set-up required with Tableau. In minutes you’ll be consolidating numbers, and visualize results without any advance set-up. Helping to free up your IT resources, and enabling you to quickly arrive at results.

  • Perfect Mashups

Connect to data in one click and layer in a second data source with another. Combining data sources in the same view is so easy it feels like cheating.

  • Best Practices in a Box

Tableau has best practices built right in. You get the benefit of years of research on the best way to represent data, from color schemes that convey meaning to an elegant design that keeps users focused on what’s important.


marquis leadership

Tableau lets you use more than one data series to spotlight high and low performers in a single view. The bar chart below shows Sales by Customers, sorted in order of highest sales. I dropped the Profit data onto the Color mark, and, voila! I see not just customers who have the most sales, but also those with highest profits – just by looking at the gradation of colors from red to green. Suddenly it’s easy to see that some of the customers with lower sales are more profitable than customers with higher sales.

TEG Analytics

Tableau is the leading data visualization tool in the Gartner Quadrant And nobody knows Tableau better than us…

  • Business Performance evaluations and performance driver analysis Dashboards for the fortune500 US CPG company

  • Pricing Elasticity Analytics dashboard & simulator for one of the world’s leading shipping and logistics service company

  • Inventory management and inbuilt alert system Dashboards for world’s leading retail company

  • Bug management and planning Dashboards for large IT product development company

  • Campaign management and promotions analysis reporting for an large digital and direct mail agency

DataSelf

Tableau’s most obvious features include its easy, intuitive user interface. Open it and go, anytime and anywhere. Use it on your desktop, on the Web, or on your mobile device. Tableau showed the world what self-service BI and data discovery should be, and it keeps leading the way.

That leadership has been acknowledged year after year by the industry’s touchstone, the Gartner Magic Quadrant. Gartner. It has rated Tableau the most powerful, most intuitive analysis tool on the market for the last four years in a row.

But you don’t see the best benefit until you use it: Users stay in the flow from question to answer to more questions and more answers, all the way to insight. They don’t stop to fiddle with data, they just stay with the analysis.

datatonic

Tableau Software helps people see and understand data. Used by more than 19,000 companies and organizations worldwide, Tableau’s award-winning software delivers fast analytics and rapid-fire business intelligence. Create visualizations and dashboards in minutes, then share in seconds.The result? You get answers from data quickly, with no programming required. Data is everywhere. But most of us struggle to make sense of it.Tableau Software lets anyone visualize data and then share it on the web, no programming needed. It’s wicked-fast, easy analytics.


Originally posted on Data Science Central

Read more…

An Introduction to Data Visualization

Guest blog post by Divya Parmar

After data science, which I discussed in an earlier post, data visualization is one of the most common buzzwords thrown around in the tech and business communities. To demonstrate how one can actually visualize data, I want to use one of the hottest tools in the market right now: Tableau. You can download Tableau Public for free here, and the “Cat vs. Dog” dataset can be found here. Let’s get started.

1. Play around with the data and find what looks interesting.

After opening Tableau Public and importing my Excel file, I looked over my dataset. I was curious to see if there was relationship between the rate of cat ownership and dog ownership. So I put dog ownership on the x-axis and cat ownership on the y-axis; I then added state name as a label. All of this is done through simply dragging and dropping, and below is a snapshot of how intuitive it is.

2. Add some elements as necessary to show your insight.

There are many ways to build on the preliminary step. You can add something like a trend line to demonstrate a statistical relationship (note that there is a p-value with the trend line), which is done through the "Analysis" tab and adds more credibility. You can even give different colors or sizes to different data points, as I have done below using the number of pet households by state to emphasis the larger states.

3. Fix and improve to make usable for export, presentation, or other purpose.

Data visualization is only useful if it is simple and to the point. In the above example, the District of Columbia data point is an outlier that is making the rest of the graph harder to read. You can edit your axis to not show D.C., and can also remove the confidence bands for the trend line to remove unessential information. 

After your visualization is ready, put it to use by sharing, embedding, or whatever means works for you.  Data visualization is easier than you think, and I encourage you to get started.

About: Divya Parmar is a recent college graduate working in IT consulting. For more posts every week, and to subscribe to his blog, please click here.

Read more…

In order to use Big Data effectively, a holistic approach is required. Organizations are now using data analytics at every level, and roles that previously would have had no need to concern themselves with data are now required to have some degree of understanding in order to leverage insights.

Ensuring that data is presented in such a way as to be understood and utilized by all employees is, however, a challenge. Most Big Data actually yields neither meaning nor value, and the sheer volume coming into businesses can be overwhelming. Companies are therefore increasingly moving away from simple 2D Excel charts, and replacing or supplementing them with powerful data visualization tools.

Sophisticated data visualization is a tool that supports analytic reasoning. It accommodates the large numbers of data points provided by Big Data using additional dimensions, colors, animations, and the digital equivalents of familiar items such as dials and gauges. The user can look at the graphics provided to reveal entanglements that may otherwise be missed, and patterns in the data can be displayed to the user at great speeds. Nicholas Marko, Chief Data Officer at Geisinger Health System, notes that: ‘You need to represent a lot of information data points in a small space without losing the granularity of what you're trying to reflect. That's why you're seeing different kinds of graphics, infographics, and more art-like or cartoon-like visualizations. It's certainly better than increasing the density of dots on a graph. At a certain point, you're not getting anything more meaningful.’

There are numerous benefits to data visualization. It reduces dependence on IT, allowing them to focus on adding value and optimizing processes. Increasing the speed that data is analyzed also means that it can be acted upon quicker. Intel’s IT Manager Survey found IT managers expect that 63% of all analytics will be done in real time, and those who cannot understand and act on data in this way, face losing their competitive advantage to others.

Data visualization for mobile, in particular, is becoming increasingly important. The requirement for real time analytics, the ubiquitousness of mobile devices, and the need to have information in real time, means that many data visualization vendors are now either adapting desktop experiences to mobile formats, or taking a mobile-first approach to developing their technology. There are obvious constraints, with smartphone display space limited. Information needs to be presented more simply than on a desktop, so, rather than just translating a complex desktop view into a simpler mobile one, it is important to consider context. Designers are also exploiting gesture-based input to help users easily navigate different views and interact with the data.

Gartner estimated a 30% compound annual growth rate in the use of data analytics through 2015. Visualization data discovery tools offer a tremendous opportunity to manage and leverage the growing volume, variety, and velocity of new and existing data. Using the faster, deeper insights afforded, companies are more agile, and have a significant competitive advantage moving forward. 

Originally posted on Data Science Central

Read more…


Title: How Flextronics uses DataViz and Analytics to Improve Customer Satisfaction
Date: Tuesday, December 15, 2015
Time: 09:00 AM Pacific Standard Time
Duration: 1 hour

Space is limited.
Reserve your Webinar seat now

Flexibility in adapting to changing global markets and customer needs is necessary to stay competitive, and the Flextronics analytics team is tasked with making sure the Flex management team has accurate and up-to-date analytics to optimize their business’s performance, efficiency, and customer service.

In our latest DSC webinar series, Joel Woods from Flextronics’ Global Services and Solutions will share success stories around analytics for the repairs and refurbishment of customer products utilizing analytics and data visualization from Tableau and Alteryx.

You will learn how to:

  • Use data analytics to improve cost savings
  • Resolve common data challenges such as blending disparate data sources
  • Deliver automated and on-demand reporting to clients
  • Provide visualizations that show the analytics that matter to both internal teams and customers

About Flextronics:

Flextronics is an industry leading end-to-end supply chain solutions company with $26 billion in sales, generated from helping customers design, build, ship, and service their products through an unparalleled network of facilities in approximately 30 countries and across four continents. 

Speakers: 

Ross Perez, Sr Product Manager -- Tableau
Joel Woods, Advanced Analytics Lead -- Flex Inc
.
Maimoona Block, Alliance Manager -- Alteryx 

Hosted byBill Vorhies, Editorial Director -- Data Science Central


Again, Space is limited so please register early:
Reserve your Webinar seat now

After registering you will receive a confirmation email containing information about joining the Webinar.
Read more…

Smart Big Data: The All-Important 90/10 Rule

Guest blog post by Bernard Marr

The sheer volumes involved with Big Data can sometimes be staggering. So if you want to get value from the time and money you put into a data analysis project, a structured and strategic approach is very important.

The phenomenon of Big Data is giving us ever-growing volume and variety of data we which we can now store and analyze. Any regular reader of my posts knows that I personally prefer to focus on Smart Data, rather than Big Data - because the term places too much importance on the size of the data. The real potential for revolutionary change comes from the ability to manipulate, analyze and interpret new data types in ever-more sophisticated ways.

Application of the Pareto distribution and 90/10 rule in a related context

The SMART Data Framework

I’ve written previously about my SMART Data framework which outlines a step-by-step approach to delivering data-driven insights and improved business performance.

  1. Start with strategy: Formulate a plan – based on the needs of your business
  2. Measure metrics and data: Collect and store the information you need
  3. Apply analytics: Interrogate the data for insights and build models to test theories
  4. Report results: Present the findings of your analysis in a way that the people who will put them into effect will understand
  5. Transform your business

Understand your customers better, optimize business processes, improve staff wellbeing or increase revenues and profits.

My work involves helping businesses use data to drive business value. Because of this I get to see a lot of half-finished data projects, mothballed when it was decided that external help was needed.

The biggest mistake by far is putting insufficient thought – or neglecting to put any thought – into a structured strategic approach to big data projects. Instead of starting with strategy, too many companies start with the data. They start frantically measuring and recording everything they can in the belief that big data is all about size. Then they get lost in the colossal mishmash of everything they’ve collected, with little idea of how to go about mining the all-important insights.

This is why I have come up with the 90/10 rule – When working with data, 90% of your time should be spent on a structured strategic approach, while 10% of your time should be spent “exploring” the data.

The 90/10 Rule

The 90% structured time should be used putting the steps outlined in the SMART Data framework into operation. Making a logical progression through an ordered set of steps with a defined beginning (a problem you need to solve), middle (a process) and an ending (answers or results).

This is after all why we call it Data Science. Business data projects are very much like scientific experiments, where we run simulations testing the validity of theories and hypothesis, to produce quantifiable results. 

The other 10% of your time can be spent freely playing with your data – mining for patterns and insights which, while they may be valuable in other ways, are not an integral part of your SMART Data strategy.

Yes, you can be really lucky and your data exploration can deliver valuable insights – and who knows what you might find, or what inspiration may come to you? But it should always play second-fiddle to following the structure of your data project in a methodical and comprehensive way.

Always start with strategy

I think this is a very important point to make, because it’s something I often see companies get the wrong way round. Too often, the data is taken as the starting point, rather than the strategy.

Businesses that do this run the very real risk of becoming “data rich and insight poor”. They are in danger of missing out on the hugely exciting benefits that a properly implemented and structured data-driven initiative can bring.

Working in a structured way means “Starting with strategy”, which means identifying a clear business need and what data you will need to solve it. Businesses that do this, and follow it through in a methodical way will win the race to unearth the most valuable and game-changing insights.

I hope you found this post interesting. I am always keen to hear your views on the topic and invite you to comment with any thoughts you might have.

About : Bernard Marr is a globally recognized expert in analytics and big data. He helps companies manage, measure, analyze and improve performance using data.

His new book is: Big Data: Using Smart Big Data, Analytics and Metrics To Make Better Decisions and Improve Performance You can read a free sample chapter here

DSC Resources

Additional Reading

Follow us on Twitter: @DataScienceCtrl | @AnalyticBridge

Read more…

 

Guest blog by Greg Roberts at Packt Publishing

(to see this graph in its fully interactive form see http://gregroberts.github.io/)

I love Python, and to celebrate Packt’s Python Week, I’ve spent some time developing an app using some of my favourite tools. The app is a graph visualisation of Python and related topics, as well as showing where all our content fits in. The topics are all StackOverflow tags, related by their co-occurrence in questions on the site.

The app is available to view at http://gregroberts.github.io/ and in this blog, I’m going to discuss some of the techniques I used to construct the underlying dataset, and how I turned it into an online application using some of my favourite tools.

Graphs, not charts

Graphs are an incredibly powerful tool for analysing and visualising complex data. In recent years, many different graph database engines have been developed to make use of this novel manner of representing data. These databases offer many benefits over traditional, relational databases because of how the data is stored and accessed.

Here at Packt, I use a Neo4j graph to store and analyse data about our business. Using the Cypher query language, it’s easy to express complicated relations between different nodes succinctly.

It’s not just the technical aspect of graphs which make them appealing to work with. Seeing the connections between bits of data visualised explicitly as in a graph helps you to see the data in a different light, and make connections that you might not have spotted otherwise. This graph has many uses at Packt, from customer segmentation to product recommendations. In the next section, I describe the process I use to generate recommendations from the database.

Make the connection

For product recommendations, I use what’s known as a hybrid filter. This considers both content based filtering (product x and y are about the same topic) and collaborative filtering (people who bought x also bought y). Each of these methods has strengths and weaknesses, so combining them into one algorithm provides a more accurate signal.

 

The collaborative aspect is straightforward to implement in Cypher. For a particular product, we want to find out which other product is most frequently bought alongside it. We have all our products and customers stored as nodes, and purchases are stored as edges. Thus, the Cypher query we want looks like this:

 

MATCH (n:Product {title:’Learning Cypher’})-[r:purchased*2]-(m:Product)

WITH m.title AS suggestion, count(distinct r)/(n.purchased+m.purchased) AS alsoBought

WHERE m<>n

RETURN* ORDER BY alsoBought DESC

 

and will very efficiently return the most commonly also purchased product. When calculating the weight, we divide by the total units sold of both titles, so we get a proportion returned. We do this so we don’t just get the titles with the most units; we’re effectively calculating the size of the intersection of the two titles’ audiences relative to their overall audience size.

 

The content side of the algorithm looks very similar:

 

MATCH (n:Product {title:’Learning Cypher’})-[r:is_about*2]-(m:Product)

WITH m.title AS suggestion, count(distinct r)/(length(n.topics)+length(m.topics)) AS alsoAbout

WHERE m<>n

RETURN * ORDER BY alsoAbout DESC

 

Implicit in this algorithm is knowledge that a title is_about a topic of some kind. This could be done manually, but where’s the fun in that?

In Packt’s domain there already exists a huge, well moderated corpus of technology concepts and their usage: StackOverflow. The tagging system on StackOverflow not only tells us about all the topics developers across the world are using, it also tells us how those topics are related, by looking at the co-occurrence of tags in questions. So in our  graph, StackOverflow tags are nodes in their own right, which represent topics. These nodes are connected via edges, which are weighted to reflect their co-occurrence on StackOverflow:

 

edge_weight(n,m) = (# of questions tagged with both n & m)/(# questions tagged with n or m)

[/code]

So, to find topics related to a given topic, we could execute a query like this:

[code]

MATCH (n:StackOverflowTag {name:'Matplotlib'})-[r:related_to]-(m:StackOverflowTag)

RETURN n.name, r.weight, m.name ORDER BY r.weight DESC LIMIT 10

 

Which would return the following:

 

    | n.name | r.weight | m.name          

----+------------+----------+--------------------

 1 | Matplotlib | 0.065699 | Plot            

 2 | Matplotlib | 0.045678 | Numpy           

 3 | Matplotlib | 0.029667 | Pandas          

 4 | Matplotlib | 0.023623 | Python          

 5 | Matplotlib | 0.023051 | Scipy           

 6 | Matplotlib | 0.017413 | Histogram       

 7 | Matplotlib | 0.015618 | Ipython         

 8 | Matplotlib | 0.013761 | Matplotlib Basemap

 9 | Matplotlib | 0.013207 | Python 2.7      

10 | Matplotlib | 0.012982 | Legend         

 

There are many, more complex relationships you can define between topics like this, too. You can infer directionality in the relationship by looking at the local network, or you could start constructing Hyper graphs using the extensive StackExchange API.

 

So we have our topics, but we still need to connect our content to topics. To do this, I’ve used a two stage process.

Step 1 – Parsing out the topics

We take all the copy (words) pertaining to a particular product as a document representing that product. This includes the title, chapter headings, and all the copy on the website. We use this because it’s already been optimised for search, and should thus carry a fair representation of what the title is about. We then parse this document and keep all the words which match the topics we’ve previously imported.

 

#...code for fetching all the copy for all the products

key_re =  '\W(%s)\W' % '|'.join(re.escape(i) for i in topic_keywords)

for i in documents

      tags = re.findall(key_re, i[‘copy’])

      i['tags'] = map(lambda x: tag_lookup[x],tags)

 

Having done this for each product, we have a bag of words representing each product, where each word is a recognised topic.

Step 2 – Finding the information

From each of these documents, we want to know the topics which are most important for that document. To do this, we use the tT-idf algorithm. tT-idf stands for term frequency, inverse document frequency. The algorithm takes the number of times a term appears in a particular document, and divides it by the proportion of the documents that word appears in. The term frequency factor boosts terms which appear often in a document, whilst the inverse document frequency factor gets rid of terms which are overly common across the entire corpus (for example, the term ‘programming’ is common in our product copy, and whilst most of the documents ARE about programming, this doesn’t provide much discriminating information about each document).

 

To do all of this, I use python (obviously) and the excellent scikit-learn library. Tf-idf is implemented in the class sklearn.feature_extraction.text.TfidfVectorizer. This class has lots of options you can fiddle with to get more informative results.

 

import sklearn.feature_extraction.text as skt

tagger = skt.TfidfVectorizer(input = 'content',

                        encoding = 'utf-8',

                        decode_error = 'replace',

                        strip_accents = None,

                        analyzer = lambda x: x,

                        ngram_range = (1,1),

                        max_df = 0.8,

                        min_df = 0.0,

                        norm =  'l2',

                        sublinear_tf = False)

 

It’s a good idea to use the min_df & max_df arguments of the constructor so as to cut out the most common/obtuse words, to get a more informative weighting. The ‘analyzer’ argument tells it how to get the words from each document, in our case, the documents are already lists of normalised words, so we don’t need anything additional done

 

#create vectors of all the documents

vectors = tagger.fit_transform(map(lambda x: x['tags'],rows)).toarray()

#get back the topic names to map to the graph

t_map = tagger.get_feature_names()

jobs = []

for ind, vec in enumerate(vectors):

      features = filter(lambda x: x[1]>0, zip(t_map,vec))

      doc = documents[ind]

      for topic, weight in features:

            job = ‘’’MERGE (n:StackOverflowTag {name:’%s’})

            MERGE (m:Product {id:’%s’})

            CREATE UNIQUE (m)-[:is_about {source:’tf_idf’,weight:%d}]-(n)

’’’ % (topic, doc[‘id’], weight)

            jobs.append(job)

 

We then execute all of the jobs using Py2neo’s Batch functionality.

 

Having done all of this, we can now relate products to each other in terms of what topics they have in common:

 

MATCH (n:Product {isbn10:'1783988363'})-[r:is_about]-(a)-[q:is_about]-(m:Product {isbn10:'1783289007'})

WITH a.name as topic, r.weight+q.weight AS weight

RETURN topic

ORDER BY weight DESC limit 6

 

 

Which returns:

 

   | topic         

---+------------------

1 | Machine Learning

2 | Image         

3 | Models        

4 | Algorithm     

5 | Data          

6 | Python

 

Huzzah! I now have a graph into which I can throw any piece of content about programming or software, and it will fit nicely into the network of topics we’ve developed.

Take a breath

 

So, that’s how the graph came to be. To communicate with Neo4j from Python, I use the excellent py2neo module, developed by Nigel Small. This module has all sorts of handy abstractions to allow you to work with nodes and edges as native Python objects, and then update your Neo instance with any changes you’ve made.

The graph I’ve spoken about is used for many purposes across the business, and has grown in size and scope significantly over the last year. For this project, I’ve taken from this graph everything relevant to Python.

I started by getting all of our content which is_about Python, or about a topic related to python:

 

titles = [i.n for i in graph.cypher.execute('''MATCH (n)-[r:is_about]-(m:StackOverflowTag {name:'Python'}) return distinct n''')]

t2 = [i.n for i in graph.cypher.execute('''MATCH (n)-[r:is_about]-(m:StackOverflowTag)-[:related_to]-(m:StackOverflowTag {name:'Python'}) where has(n.name) return distinct n''')]

titles.extend(t2)

 

then hydrated this further by going one or two hops down each path in various directions, to get a large set of topics and content related to Python.

 

Visualising the graph

 

Since I started working with graphs, two visualisation tools I’ve always used are Gephi and Sigma.js.

 

Gephi is a great solution for analysing and exploring graphical data, allowing you to apply a plethora of different layout options, find out more about the statistics of the network, and to filter and change how the graph is displayed.

 

Sigma.js is a lightweight JavaScript library which allows you to publish beautiful graph visualisations in a browser, and it copes very well with even very large graphs.

 

Gephi has a great plugin which allows you to export your graph straight into a web page which you can host, share and adapt.

 

More recently, Linkurious have made it their mission to bring graph visualisation to the masses. I highly advise trying the demo of their product. It really shows how much value it’s possible to get out of graph based data. Imagine if your Customer Relations team were able to do a single query to view the entire history of a case or customer, laid out as a beautiful graph, full of glyphs and annotations.

Linkurious have built their product on top of Sigma.js, and they’ve made available much of the work they’ve done as the open source Linkurious.js. This is essentially Sigma.js, with a few changes to the API, and an even greater variety of plugins. On Github, each plugin has an API page in the wiki and a downloadable demo. It’s worth cloning the repository just to see the things it’s capable of!

 

Publish It!

So here’s the workflow I used to get the Python topic graph out of Neo4j and onto the web.

 

-Use Py2neo to graph the subgraph of content and topics pertinent to Python, as described above

 

-Add to this some other topics linked to the same books to give a fuller picture of the Python “world”

 

-Add in topic-topic edges and product-product edges to show the full breadth of connections observed in the data

 

-export all the nodes and edges to csv files

 

-import node and edge tables into Gephi.

 

The reason I’m using Gephi as a middle step is so that I can fiddle with the visualisation in Gephi until it looks perfect. The layout plugin in Sigma is good, but this way the graph is presentable as soon as the page loads, the communities are much clearer, and I’m not putting undue strain on browsers across the world!

 

-The layout of the graph has been achieved using a number of plugins. Instead of using the pre-installed ForceAtlas layouts, I’ve used the OpenOrd layout, which I feel really shows off the communities of a large graph. There’s a really interesting and technical presentation about how this layout works here.

 

-Export the graph into gexf format, having applied some partition and ranking functions to make it more clear and appealing.

 

Now it’s all down to Linkurious and its various plugins! You can explore the source code of the final page to see all the details, but here I’ll give an overview of the different plugins I’ve used for the different parts of the visualisation:

First instantiate the graph object, pointing to a container (note the CSS of the container, without this, the graph won’t display properly:

 

<style type="text/css">

  #container {

            max-width: 1500px;

            height: 850px;

            margin: auto;

            background-color: #E5E5E5;

  }

</style>

<div id="container"></div>

<script>

s= new sigma({

            container: 'container',

            renderer: {

                        container: document.getElementById('container'),

                        type: 'canvas'

            },

            settings: {

                        …

            }

});

 

sigma.parsers.gexf - used for (trivially!) importing a gexf file into a sigma instance

 

sigma.parsers.gexf(

            'static/data/Graph1.gexf',

            s,

            function(s) {

                        //callback executed once the data is loaded, use this to set up any aspects of the app which depend on the data

            });

 

-sigma.plugins.filter - Adds the ability to very simply hide nodes/edges based on a callback function. This powers the filtering widget on the page.

 

<input class="form-control" id="min-degree" type="range" min="0" max="0" value="0">

function applyMinDegreeFilter(e) {

                        var v = e.target.value;

                        $('#min-degree-val').textContent = v;

                        filter

                        .undo('min-degree')

                        .nodesBy(

                                    function(n, options) {

                                                return this.graph.degree(n.id) >= options.minDegreeVal;

                                    },{

                                                minDegreeVal: +v

                                    },

                                    'min-degree'

                        )

                        .apply();

};

$('#min-degree').change(applyMinDegreeFilter);

 

-sigma.plugins.locate - Adds the ability to zoom in on a single node or collection of nodes. Very useful if you’re filtering a very large initial graph

 

function locateNode (nid) {

            if (nid == '') {

                        locate.center(1);

            }

            else {

                        locate.nodes(nid);

            }

};

 

-sigma.renderers.glyphs - Allows you to add custom glyphs to each node. Useful if you have many types of node.

 

Outro

This application has been a very fun little project to build. The improvements to Sigma wrought by Linkurious have resulted in an incredibly powerful toolkit to rapidly generate graph based applications with a great degree of flexibility and interaction potential.

 

None of this would have been possible were it not for Python. Python is my right (left, I’m left handed) hand which I use for almost everything. Its versatility and expressiveness make it an incredibly robust Swiss army knife in any data-analysts toolkit.

Do more with Python! This week Packt is celebrating Python with a 50% discount on their leading titles. Take a look at what’s on offer and expand your programming horizons today.

Author Bio:

Greg Roberts is a Data Analyst at Packt Publishing, and has a Masters degree in Theoretical Physics and Applied Maths. Since joining Packt he has developed a passion for working with data, and enjoys learning about new or novel methods of predictive analysis and machine learning. To this end, he spends most of his time fiddling with models in python, and creating ever more complex Neo4j instances, for storing and analysing any and all data he can find. When not writing Python, he is a big fan of reddit, cycling and making interesting noises with a guitar.

You can find Greg on Twitter @GregData

Originally posted on Data Science Central 

Read more…

What Defines a Big Data Scenario?

Guest blog post by Khosrow Hassibi

Big data is a new marketing term that highlights the everincreasing and exponential growth of data in every aspect of our lives. The term big data originated from within the open-source community, where there was an effort to develop analytics processes that were faster and more scalable than traditional data warehousing, and could extract value from the vast amounts of unstructured and semistructured data produced daily by web users. Consequently, big data origins are tied to web data, though today big data is used in a larger context. Read more here ......

Infographic Source, "Data Never Sleeps 2.0"; 

Read more…

Environmental Monitoring using Big Data

Guest blog post by Heinrich von Keler

In this post, I will cover in-depth a Big Data use case: monitoring and forecasting air pollution.

A typical Big Data use case in the modern Enterprise includes the collection and storage of sensor data, executing data analytics at scale, generating forecasts, creating visualization portals, and automatically raising alerts in the case of abnormal deviations or threshold breaches.

This article will focus on an implemented use case: monitoring and analyzing air quality sensor data using Axibase Time-Series Database and R Language.

Steps taken by the data science team to execute the use case:

  • Collect historical data from AirNow into ATSD
  • Stream current data from AirNow into ATSD
  • Use R Language to execute data analytics and generate forecasts for all collected entities and metrics
  • Create Holt-Winters forecasts in ATSD for all collected entities and metrics
  • Build a visualization portal
  • Setup alert and notification rules in the ATSD Rule Engine

The Data

Hourly readings of several key air quality metrics are being generated by over 2,000 monitoring sensor stations located in over 300 cities across the United States, the historical and streaming data is retrieved and stored in ATSD.

The data is provided by AirNow, which is a U.S. government EPA program that protects public health by providing forecast and real-time air quality information.

The two main collected metrics are PM2.5 and Ozone (o3).

PM2.5 is particles less than 2.5 micrometers in diameter, often called “fine” particles. These particles are so small they can be detected only with an electron microscope. Sources of fine particles include all types of combustion, including motor vehicles, power plants, residential wood burning, forest fires, agricultural burning, and industrial processes.

o3 (Ozone) occurs naturally in the Earth’s upper atmosphere, 6 to 30 miles above the Earth’s surface where it forms a protective layer that shields us from the sun’s harmful ultraviolet rays. Man-made chemicals are known to destroy this beneficial ozone.

Other collected metrics are: pm10 (particulate matter up to 10 micrometers in size), co (Carbon Monoxide), no2 (nitrogen dioxide) and so2 (sulfur dioxide).

Collecting/Streaming the Data

A total of 5 years of historical data has been collected, stored, analyzed and accurately forecast. In order for the forecasts to have maximum accuracy, account for trends and for seasonal cycles, at least 3 to 5 years of detailed historical data is recommended.

An issue with the accuracy of the data was immediately determined. The data was becoming available with a fluctuating time delay of 1 to 3 hours. An analysis was conducted by collecting all values for each metric and entity, resulting in several data points being recorded for the same metric, entity and time. This led us to believe that there was both a time delay and stabilization period. Below are the results:

Once available, the data then took another 3 to 12 hours to stabilize, meaning that the values were fluctuating during that time frame for most data points.

As a result of this analysis, it was decided, that all data will be collected with a 12 hour delay in order to increase the accuracy of the data and forecasts.

Axibase Collector was used to collect the data from monitoring sensor stations and stream into Axibase Time-Series Database.

In Axibase Collector a job was setup to collect data from the air monitoring sensor stations in Fresno, California. For this particular example, Fresno was selected because it is considered one of the most polluted cities in the United States, with air quality warnings being often issued to the public.

The File Job sets up a cron task that runs at a specified interval to collect the data and batch upload it into ATSD.

The File Forwarding Configuration is a parser configuration for data incoming from an external source. The path to the external data source is specified, a default entity is assigned to the Fresno monitoring sensor station, start time and end time determine the time frame for retrieving new data (end time syntax is used).

Once these two configurations are saved, the collector starts streaming fresh data into ATSD.

The entities and metrics streamed by the collector into ATSD can be viewed from the UI.

The whole data-set currently has over 87,000,000 records for each metric, all stored in ATSD.

Generating Forecasts in R

The next step was to analyze the data and generate accurate forecasts. Built-in Holt-Winters and Arima algorithms were used in ATSD and custom R language data forecasting algorithms were used for comparison.

To analyze the data in R, the R language API client was used to retrieve the data and then save the custom forecasts back into ATSD.

Forecasts were built for all metrics for the period  of May, 11 until June, 1.

The steps taken to forecast the pm2.5 metric will be highlighted.

The Rssa package was used to generate the forecast. This package implements Singular Spectrum Analysis (SSA) method.

Recommendations from the following sources were used to choose parameters for SSA forecasting:

The following steps were executed when building the forecasts:

pm2.5 series was retrieved from ATSD using the query() function. 72 days of data were loaded.
SSA decomposition was built with a window of 24 days and 100 eigen triples:

dec <- ssa(values, L = 24 * 24, neig = 100)


eigen values, eigen vectors, pairs of sequential eigen vectors and w-correlation matrix of the decomposition were graphed:


plot(dec, type = "values")

plot(dec, type = "vectors", idx = 1:20)

plot(dec,type = "paired", idx = 1:20)

plot(wcor(dec), idx = 1:100)

A group of eigen triples was then selected to use when forecasting. The plots suggest several options.

Three different options were tested: 1, 1:23, and 1:35, because groups 1, 2:23 and 24:35 are separated from other eigen vectors, as judged from the w-correlation matrix.

The rforecast() function was used to build the forecast:

rforecast(x = dec, groups = 1:35, len = 21 * 24, base = "original")


Tests were run with vforecast(), and bforecast() using different parameters, but rforecast() was determined to be the best option in this case.

Graph of the original series and three resulting forecasts:

Forecast with eigen triples 1:35 was selected as the most accurate and saved into ATSD.

  • To save forecasts into ATSD the save_series() function was used.

Generating Forecasts in ATSD

The next step was to create a competing forecast in ATSD using the built-in forecasting features. Majority of the settings were left in automatic mode, so the system itself determines the best parameters (based on the historical data) when generating the forecast.

Visualizing the Results

To visualize the data and forecasts, a portal was created using the built-in visualization features.

Thresholds have been set for each metric, in order to alert the user when either the forecast or actual data are reaching unhealthy levels of air pollution.

When comparing the R forecasts and ATSD forecasts to the actual data, the ATSD forecasts turned out to be significantly more accurate in most cases, learning and recognizing the patterns and trends with more certainty. Until this point in time, as the actual data is coming in, it is following the ATSD forecast very closely, any deviations are minimal and fall within the confidence interval.

It is clear that the built-in forecasting of ATSD often produces more accurate results than even one of the most advanced R language forecasting algorithms that was used as part of this use case. It is absolutely possible to rely on ATSD to forecast air pollution for few days/weeks into the future.

You can keep track of how these forecasts perform in comparison to the actual data in Chart Lab.

Alerts and Notifications

A smart alert notification was setup in the Rule Engine to notify the user by email if the pollution levels breach the set threshold or deviate from the ATSD forecast.

Analytical rules set in Rule Engine for pm2.5 metric – alerts will be raised if the streaming data satisfies one of the rules:

value > 30 - Raise an alert if last metric value exceeds threshold

forecast_deviation(avg()) > 2 - Raise an alert if the actual values exceeds the forecast by more than 2 standard deviations, see image below. Smart rules capture extreme spikes in air pollution.

At this point the use case is fully implemented and will function autonomously; ATSD automatically streams the sensor data, generates a new forecast every 24 hours for 3 weeks into the future and raises alerts if the pollution levels rise above the threshold or if a negative trend is discovered.

Results and Conclusions

The results of this use case are useful for travelers, for whom it is important to have an accurate forecast of environmental and pollution related issues that they may face during their visits or for expats moving to work in a new city or country. Studies have proven that long-term exposure to high levels of pm2.5 can lead to serious health issues.

This research and environmental forecasting is especially valuable in regions like China, where air pollution is seriously affecting the local population and visitors. In cities like ShanghaiBeijing and Guangzhou, pm2.5 levels are constantly fluctuating from unhealthy to critical levels and yet accurate forecasting is limited. Pm2.5 forecasting is critical for travelers and tourists who need to plan their trips during periods of lower pollution levels due to potential health risks associated with exposure to this sort of pollution.

Government agencies can also take advantage of pollution monitoring to plan and issue early warnings to travelers and locals, so that precautions can be taken to prevent exposure to unhealthy levels of pm2.5 pollution. Detecting a trend and raising an alert prior to pm2.5 levels breaching the unhealthy threshold is critical for public safety and health. Having good air quality data and performing data analytics can allow people to adapt and make informed decisions.

Big Data Analytics is an empowerment tool that can put valuable information in the hands of corporations, governments and individuals, and that knowledge can help motivate or give people tools to stimulate change. Air pollution is currently affecting the lives of over a billion people across the globe and with current trends the situation will only get worse. Often the exact source of the air pollution, how it’s interacting in the air and how it’s dispersing cannot be determined, the lack of such information makes it a difficult problem to tackle. With advances in modern technologies and new Big Data solutions, it is becoming possible to combine sensor data with meteorological satellite data to perform extensive data analytics and forecasting. Through Big Data analytics it will be possible to pinpoint the pollution source and dispersion trends days in advanced.

I sincerely believe that Big Data has a large role to play in tackling air pollution and that in the coming years advanced data analytics will be a key tool influencing government decisions and regulation change.

You can learn more about Big Data analytics, forecasting and visualization at Axibase.

Read more…

Time Series Analysis using R-Forecast package

Guest blog post by suresh kumar Gorakala

In today’s blog post, we shall look into time series analysis using R package – forecast. Objective of the post will be explaining the different methods available in forecast package which can be applied while dealing with time series analysis/forecasting.

What is Time Series?

A time series is a collection of observations of well-defined data items obtained through repeated measurements over time. For example, measuring the value of retail sales each month of the year would comprise a time series.

Objective:

  • Identify patterns in the data – stationarity/non-stationarity. 
  • Prediction from previous patterns.

Time series Analysis in R:

My data set contains data of Sales of CARS from Jan-2008 to Dec 2013.

Problem Statement: Forecast sales for 2013

MyData[1,1:14]

PART

Jan08

FEB08

MAR08

....

....

NOV12

DEC12

MERC

100

127

56

....

....

776

557

Table: shows the first row data from Jan 2008 to Dec 2012

Results:

The forecasts of the timeseries data will be:

Assuming that the data sources for the analysis are finalized and cleansing of the data is done, for further details,

Step1: Understand the data: 

As a first step, Understand the data visually, for this purpose, the data is converted to time series object using ts(), and plotted visually using plot() functions available in R.

ts = ts(t(data[,7:66])) 

plot(ts[1,],type=’o’,col=’blue’) 

Image above shows the monthly sales of an automobile

Forecast package & methods:

Forecast package is written by Rob J Hyndman and is available from CRAN here. The package contains Methods and tools for displaying and analyzing univariate time series forecasts including exponential smoothing via state space models and automatic ARIMA modelling.

Before going into more accurate Forecasting functions for Time series, let us do some basic forecasts using Meanf(), naïve(), random walk with drift – rwf() methods. Though these may not give us proper results but we can use the results as bench marks.

All these forecasting models returns objects which contain original series, point forecasts, forecasting methods used residuals. Below functions shows three methods & their plots.

Library(forecast)

mf = meanf(ts[,1],h=12,level=c(90,95),fan=FALSE,lambda=NULL)

plot(mf)

 



mn = naive(ts[,1],h=12,level=c(90,95),fan=FALSE,lambda=NULL) 

plot(mn)

 



md = rwf(ts[,1],h=12,drift=T,level=c(90,95),fan=FALSE,lambda=NULL) 

plot(md) 

Measuring accuracy:

 Once the model has been generated the accuracy of the model can tested using accuracy(). The Accuracy function returns MASE value which can be used to measure the accuracy of the model. The best model is chosen from the below results which gives have relatively lesser values of ME,RMSE,MAE,MPE,MAPE,MASE.

> accuracy(md)

                                         ME     RMSE       MAE          MPE    MAPE     MASE

Training set      1.806244e-16 2.445734 1.889687 -41.68388 79.67588 1.197689

accuracy(mf)

                                        ME      RMSE        MAE         MPE     MAPE MASE

Training set        1.55489e-16  1.903214 1.577778 -45.03219 72.00485         1

> accuracy(mn)

                              ME   RMSE       MAE         MPE      MAPE     MASE

Training set 0.1355932 2.44949 1.864407 -36.45951 76.98682 1.181666

 Step2: Time Series Analysis Approach:

A typical time-series analysis involves below steps:

  • Check for identifying under lying patterns - Stationary & non-stationary, seasonality, trend. 
  • After the patterns have been identified, if needed apply Transformations to the data – based on Seasonality/trends appeared in the data.
  • Apply forecast() the future values using Proper ARIMA model obtained by auto.arima() methods.

Identify Stationarity/Non-Stationarity:

A stationary time series is one whose properties do not depend on the time at which the series is observed. Time series with trends, or with seasonality, are not stationary.

The stationarity /non-stationarity of the data can be known by applying Unit Root Tests - augmented Dickey–Fuller test (ADF), Kwiatkowski-Phillips-Schmidt-Shin (KPSS) test.

ADF: The null-hypothesis for an ADF test is that the data are non-stationary. So large p-values are indicative of non-stationarity, and small p-values suggest stationarity. Using the usual 5% threshold, differencing is required if the p-value is greater than 0.05.

 library(tseries)

adf = adf.test(ts[,1])

adf

        Augmented Dickey-Fuller Test

data:  ts[, 1]

Dickey-Fuller = -4.8228, Lag order = 3, p-value = 0.01

alternative hypothesis: stationary

The above figure suggests us that the data is of stationary and we can go ahead with ARIMA models.

 

KPSS: Another popular unit root test is the Kwiatkowski-Phillips-Schmidt-Shin (KPSS) test. This reverses the hypotheses, so the null-hypothesis is that the data are stationary. In this case, small p-values (e.g., less than 0.05) suggest that differencing is required.

kpss = kpss.test(ts[,1])

Warning message:

In kpss.test(ts[, 1]) : p-value greater than printed p-value

kpss

        KPSS Test for Level Stationarity

data:  ts[, 1]

KPSS Level = 0.1399, Truncation lag parameter = 1, p-value = 0.1

Differencing:

Based on the unit test results we identify whether the data is stationary or not. If the data is stationary then we choose optimal ARIMA models and forecasts the future intervals. If the data is non- stationary, then we use Differencing - computing the differences between consecutive observations. Use ndiffs(),diff() functions to find the number of times differencing needed for the data &  to difference the data respectively.

ndiffs(ts[,1])

[1] 1

diff_data = diff(ts[,1])

Time Series:

Start = 2

End = 60

Frequency = 1

 [1]  1  5 -3 -1 -1  0  3  1  0 -4  4 -5  0  0  1  1  0  1  0  0  2 -5  3 -2  2  1 -3  0  3  0  2 -1 -5  3 -1

[36] -1  2 -1 -1  5 -2  0  2 -2 -4  0  3  1 -1  0  0  0 -2  2 -3  4 -3  2  5

 

Now retest for stationarity by applying acf()/kpss() functions if the plots shows us the Stationarity then Go ahead by applying ARIMA Models.

Identify Seasonality/Trend:

The seasonality in the data can be obtained by the stl()when plotted

Stl = Stl(ts[,1],s.window=”periodic”)

Series is not period or has less than two periods

 

Since my data doesn’t contain any seasonal behavior I will not touch the Seasonality part.

ARIMA Models:

For forecasting stationary time series data we need to choose an optimal ARIMA model (p,d,q). For this we can use auto.arima() function which can choose optimal (p,d,q) value and return us. Know more about ARIMA from here.

auto.arima(ts[,2])

Series: ts[, 2]

ARIMA(3,1,1) with drift        

Coefficients:

          ar1      ar2      ar3      ma1   drift

      -0.2621  -0.1223  -0.2324  -0.7825  0.2806

s.e.   0.2264   0.2234   0.1798   0.2333  0.1316

sigma^2 estimated as 41.64:  log likelihood=-190.85

AIC=393.7   AICc=395.31   BIC=406.16

 

Forecast time series:

 

Now we use forecast() method to forecast the future events.

forecast(auto.arima(dif_data))   Point Forecast     Lo 80      Hi 80     Lo 95    Hi 9561   -3.076531531 -5.889584 -0.2634795 -7.378723 1.22566062    0.231773625 -2.924279  3.3878266 -4.594993 5.05854063    0.702386360 -2.453745  3.8585175 -4.124500 5.52927264   -0.419069906 -3.599551  2.7614107 -5.283195 4.44505565    0.025888991 -3.160496  3.2122736 -4.847266 4.89904466    0.098565814 -3.087825  3.2849562 -4.774598 4.97172967   -0.057038778 -3.243900  3.1298229 -4.930923 4.81684668    0.002733053 -3.184237  3.1897028 -4.871317 4.87678369    0.013817766 -3.173152  3.2007878 -4.860232 4.88786870   -0.007757195 -3.194736  3.1792219 -4.881821 4.866307

plot(forecast(auto.arima(dif_data)))

The below flow chart will give us a summary of the time series ARIMA models approach:

The above flow diagram explains the steps to be followed for a time series forecasting

Please visit my blog dataperspective for more articles

Read more…

15 Questions All R Users Have About Plots

Guest blog post by Bill Vorhies

Posted by DataCamp July 30th, 2015.

See the full blog here

R allows you to create different plot types, ranging from the basic graph types like density plots, dot plots, bar charts, line charts, pie charts, boxplots and scatter plots, to the more statistically complex types of graphs such as probability plots, mosaic plots and correlograms.

In addition, R is pretty known for its data visualization capabilities: it allows you to go from producing basic graphs with little customization to plotting advanced graphs with full-blown customization in combination with interactive graphics. Nevertheless, not always do we get the results that we want for our R plots:

Here’s a quick list of what’s included:

1. How To Draw An Empty R Plot?

  • How To Open A New Plot Frame
  • How To Set Up The Measurements Of The Graphics Window
  • How To Draw An Actual Empty Plot

2. How To Set The Axis Labels And Title Of The R Plots?

  • How To Name Axes (With Up- Or Subscripts) And Put A Title To An R Plot?
  • How To Adjust The Appearance Of The Axes’ Labels
  • How To Remove A Plot’s Axis Labels And Annotations
  • How To Rotate A Plot’s Axis Labels
  • How To Move The Axis Labels Of Your R Plot

3. How To Add And Change The Spacing Of The Tick Marks Of Your R Plot

  • How To Change The Spacing Of The Tick Marks Of Your R Plot
  • How To Add Minor Tick Marks To An R Plot

4. How To Create Two Different X- or Y-axes

5. How To Add Or Change The R Plot’s Legend?

  • Adding And Changing An R Plot’s Legend With Basic R
  • How To Add And Change An R Plot’s Legend And Labels In ggplot2

6. How To Draw A Grid In Your R Plot?

  • Drawing A Grid In Your R Plot With Basic R
  • Drawing A Grid In An R Plot With ggplot2

7. How To Draw A Plot With A PNG As Background?

8. How To Adjust The Size Of Points In An R Plot?

  • Adjusting The Size Of Points In An R Plot With Basic R
  • Adjusting The Size Of Points In Your R Plot With ggplot2

9. How To Fit A Smooth Curve To Your R Data

10. How To Add Error Bars In An R Plot

  • Drawing Error Bars With Basic R
  • Drawing Error Bars With ggplot2
  • Error Bars Representing Standard Error Of Mean
  • Error Bars Representing Confidence Intervals
  • Error Bars Representing The Standard Deviation

11. How To Save A Plot As An Image On Disc

12. How To Plot Two R Plots Next To Each Other?

  • How To Plot Two Plots Side By Side Using Basic R
  • How To Plot Two Plots Next To Each Other Using ggplot2
  • How To Plot More Plots Side By Side Using gridExtra
  • How To Plot More Plots Side By Side Using lattice
  • Plotting Plots Next To Each Other With gridBase

13. How To Plot Multiple Lines Or Points?

  • Using Basic R To Plot Multiple Lines Or Points In The Same R Plot
  • Using ggplot2 To Plot Multiple Lines Or Points In One R Plot

14. How To Fix The Aspect Ratio For Your R Plots

  • Adjusting The Aspect Ratio With Basic R
  • Adjusting The Aspect Ratio For Your Plots With ggplot2
  • Adjusting The Aspect Ratio For Your Plots With MASS

15. What Is The Function Of hjust And vjust In ggplot2?

Read more…

Here we ask you to identify which tool was used to produce the following 18 charts: 4 were done with R, 3 with SPSS, 5 with Excel, 2 with Tableau, 1 with Matlab, 1 with Python, 1 with SAS, and 1 with JavaScript. The solution, including for each chart a link to the webpage where it is explained in detail (many times with source code included) can be found here. You need to be a DSC member to access the page with the solution: you can sign-up here.

How do you score? Would this be a good job interview question?

Chart 1

Chart 2

Chart 3

Chart 4

Chart 5

Chart 6

Chart 7

Chart 8

Chart 9

Chart 10

Chart 11

Chart 12

Chart 13

Chart 14

Chart 15

Chart 16

Chart 17

Chart 18

DSC Resources

Additional Reading

Follow us on Twitter: @DataScienceCtrl | @AnalyticBridge

Read more…

Big Data In Banking [infographic]

The Banking industry generates a large volume of data on a day to day basis. To differentiate itself from the competition, banks  are increasing adoption big data analytics as part of their core strategy. Analytics will be the critical game changer for the banks. In this infographic we explore the scale at which banks have adopted analytics in their business.

Originally posted on Data Science Central

Read more…

Making Data Visualization Work

Eye Detail

In commercial terms, how we perceive information determines how we process, interpret and action it. Our brains are wired to process visual information and how we do this ensures we are leveraging Data Visualization to full effect.

The processes behind Data Analytics mirror the brain’s own functions. Initially your brain processes visual stimulus through the retina, then to the thalamus, then the primary visual cortex and the association cortex. At each stage, there are filters that our brains apply to determine whether this information is relevant enough to continue processing. We call this ‘rubbish in’ ‘rubbish out.’ Because it is important for information to be understood quickly and easily this where Data Visualization comes into its element.

Using the Right Chart for the Right Job


Colour Wheel 2

Pre-attentive processing occurs within the first 200 milliseconds of seeing a visual. Colour, form and pattern are discernible during this phase. This is why spotting a red jelly bean in a bowl of white jelly beans is really easy.

Bar Charts illustrate a snapshot of the information better than line charts, allowing you to make a split second assessment of the value of what you are seeing. Then comes the fun part….using the correct colours to create a story enables you to emphasise information through a universally acknowledged cognitive alphabet. Red…danger, Blue…all is well, Green…growth and action, Yellow…of interest…Pastel Tones are more soothing on the eye…and so forth.

Because we discovered the world through colour and shape, our long term memory allows us to interpret Visual Data with split second clarity.

To get the best use out of colour, when building a Dashboard theme, follow a line of colours around the spectrum for tone on tone harmony.

Memory is the Key to Data Visualization

Psychologists put it like this; we have 3 memory components. Sensory, Working [Short Term] and Long Term. How we use them is based on a push/pull & slow / fast processing system.

Slow processes information in the present. What is 73 x 62? Fast dips into the pre-programmed paradigms and draws fast conclusions based on experience patterns of knowledge. What is 2×2?

The Sensory Register is the component of memory that holds input in its original, unencoded form. Probably everything that your body is capable of seeing, hearing is stored here. In other words, the sensory register has a large capacity; it can hold a great deal of information at one time.

Working Memory [short-term-memory], is the component of memory where new information is held while it is being processed; in other words, it is a temporary “holding bin” for new information. Working memory is also where much of our thinking, or cognitive processing, occurs. It is where we try to make sense of say this blog or solve a problem. Generally speaking, working memory is the component that probably does most of the heavy lifting. It has two characteristics that Data Analytics works around: a short duration span and a limited capacity.

Long-Term Memory is the Hall of Permanent Record. Long-term-memory is capable of holding as much information as an individual needs to store there; there is probably no such thing as a person “running out of room.” The more information already stored in long-term memory, the easier it is to learn new things.

Data Analytics brings together components of memory function and interconnects the relations through holding patterns. So when we process Analytics and Act on them, we essentially create a ‘new permanent record.’

This is what makes us smarter, faster and efficient.

Creating your Optimized Business Candy with Data Visualization

When building an Analytics Dashboard consider; The Story you are trying to build and the Questions you want answered. Encourage discovery of Hidden elements and pull together Relevant information. Present clear Relationships between 1 data set and another and choose the Correct Chart to illustrate your scenario. Pay particular attention to colour, objects, shapes, patterns and amount of information.

Quick Study: Line or Bar?

  • Lines Graphs: Demonstrate the continual nature of data and pattern.
  • Bar Charts: Illustrate value and variables, prominent attributes and ranking.
  • Bar Stacks: Show the values, contribution and ratio in blocks.
  • Percentages: Shift  emphasis from quantity to relative differences.
  • Cumulative: Summarise  all variables along a timeline.

 

AnyData works with natural brain tech to bring Data and Business together. Get the most out of Data Visualization by visiting our Learning Centre and watching the How To Videos.

Originally posted on Big Data News

Read more…

Webinar Series

Follow Us

@DataScienceCtrl | RSS Feeds

Careers