Subscribe to our Newsletter

Featured Posts (196)

Main Trends for IT in 2016 (infographic)

Guest blog and infographic from Glorium Technologies

Spiceworks report defines 4 main IT trends of upcoming 2016.

  1. Companies’ revenue is to grow while IT budget is to remain the same.
  2. As a consequence, CIOs are to make more with less.
  3. Even though security concers are going to increase, security expencses are to stay flat.
  4. The leading priority of IT busget 2016 is end of life of a product.

 

On average, each company plans to increase IT spending by max $2000 (comparing 2015 and 2016, worldwide). According to results of responses, 42% claim that budget remains the same, 38% states that it is to increase, while 10% is bound to decrease it. The reason for such outcome is willingness to keep the cost low. IT stuff is not to increase, either: 59% – no changes, 34% – plans to increase, 4% plans to decrease.

 

The main priorities of 2016 IT budget are: hardware expenses (37%), software expenses (31%), cloud-based projects (14%) and software expenses (13%).

 

Surprisingly, laptops are not to take the first place in 2016, instead of desktops. Here are how hardware expenses are to be distributed: 21% desktops, 19% servers, 16% laptops, 10% networking, 6% external storage, 6% mobile and tablets.

 

Software expenses are divided more evenly: 15% goes to virtualization, OS and productivity. 10% goes to CRM, backup and database. 9% goes to security.
The main motto of expenses: “If it ain’t broken – don’t fix it”.

 

Here are the main reasons foe companies’ investing in IT: end of life, growth or additional needs, upgrades or refresh cycles, end user need, project need, budget availability, application compatibility and new technologies or features.

 

Responses to the question “Is your security on a descent level?” are rather surprising: 61% do not conduct security audit, 59% believe that  their investments in security are not adequate, for 51% security is not a 2016 priority, 48% regard their business process as not adequately protected.

 

Conclusion:

 

  1. Companies’ revenue is going to increase, while IT budet is not going to be changed.
  2. CIO’s will have to solve strategic issues for the future with budget from the past.
  3. End of life of a product is the main boost for the IT investments.
  4. Even though, companies understand that they do not invest enough in security, security budget is to stay rather low.

 

Infographic preparation was based upon Spiceworks annual report on IT budgets and tech trends.

Originally posted on Data Science Central

Read more…

The Amazing Ways Uber Is Using Big Data

Guest blog post by Bernard Marr

Uber is a smartphone-app based taxi booking service which connects users who need to get somewhere with drivers willing to give them a ride. The service has been hugely controversial, due to regular taxi drivers claiming that it is destroying their livelihoods, and concerns over the lack of regulation of the company’s drivers.

Source for picture: Mapping a city’s flow using Uber data

This hasn’t stopped it from also being hugely successful – since being launched to purely serve San Francisco in 2009, the service has been expanded to many major cities on every continent except for Antarctica.

The business is rooted firmly in Big Data and leveraging this data in a more effective way than traditional taxi firms have managed has played a huge part in its success.

Uber’s entire business model is based on the very Big Data principle of crowd sourcing. Anyone with a car who is willing to help someone get to where they want to go can offer to help get them there.  

Uber holds a vast database of drivers in all of the cities it covers, so when a passenger asks for a ride, they can instantly match you with the most suitable drivers.

Fares are calculated automatically, using GPS, street data and the company’s own algorithms which make adjustments based on the time that the journey is likely to take. This is a crucial difference from regular taxi services because customers are charged for the time the journey takes, not the distance covered.

Surge pricing

These algorithms monitor traffic conditions and journey times in real-time, meaning prices can be adjusted as demand for rides changes, and traffic conditions mean journeys are likely to take longer. This encourages more drivers to get behind the wheel when they are needed – and stay at home when demand is low. The company has applied for a patent on this method of Big Data-informed pricing, which is calls “surge pricing”.

This algorithm-based approach with little human oversight has occasionally caused problems – it was reported that fares were pushed up sevenfold by traffic conditions in New York on New Year’s Eve 2011, with a journey of one mile rising in price from $27 to $135 over the course of the night.

This is an implementation of “dynamic pricing” – similar to that used by hotel chains and airlines to adjust price to meet demand – although rather than simply increasing prices at weekends or during public holidays, it uses predictive modelling to estimate demand in real time.  

UberPool

Changing the way we book taxis is just a part of the grand plan though. Uber CEO Travis Kalanick has claimed that the service will also cut the number of private, owner-operated automobiles on the roads of the world’s most congested cities. In an interview last year he said that he thinks the car-pooling UberPool service will cut the traffic on London’s streets by a third.

UberPool allows users to find others near to them which, according to Uber’s data, often make similar journeys at similar times, and offer to share a ride with them. According to their blog, introducing this service became a no-brainer when their data told them the “vast majority of [Uber trips in New York] have a look-a-like trip – a trip that starts near, ends near, and is happening around the same time as another trip”. 

Other initiatives either trialled or due to launch in the future include UberChopper, offering helicopter rides to the wealthy, UberFresh for grocery deliveries and Uber Rush, a package courier service.

Rating systems

The service also relies on a detailed rating system – users can rate drivers, and vice versa – to build up trust and allow both parties to make informed decisions about who they want to share a car with.

Drivers in particular have to be very conscious of keeping their standards high – a leaked internal document showed that those whose score falls below a certain threshold face being “fired” and not offered any more work.

They have another metric to worry about, too – their “acceptance rate”. This is the number of jobs they accept versus those they decline. Drivers were told they should aim to keep this above 80%, in order to provide a consistently available service to passengers.

Uber’s response to protests over its service by traditional taxi drivers has been to attempt to co-opt them, by adding a new category to its fleet. UberTaxi - meaning you will be picked up by a licensed taxi driver in a registered private hire vehicle - joined UberX (ordinary cars for ordinary journeys), UberSUV (large cars for up to 6 passengers) and UberLux (high end vehicles) as standard options.

Regulatory pressure and controversies

It will still have to overcome legal hurdles – the service is currently banned in a handful of jurisdictions including Brussels and parts of India, and is receiving intense scrutiny in many other parts of the world. Several court cases are underway in the US regarding the company’s compliance with regulatory procedures.  

Another criticism is that because credit cards are the only payment option, the service is not accessible to a large proportion of the population in less developed nations where the company has focused its growth.

But given its popularity wherever it has launched around the world, there is a huge financial incentive for the company to press ahead with its plans for revolutionising private travel.

 If regulatory pressures do not kill it, then it could revolutionise the way we travel around our crowded cities – there are certainly environmental as well as economic reasons why this would be a good thing.

Uber is not alone – it has competitors offering similar services on a (so far) smaller scale such as Lyft , Sidecar and Haxi. If a deregulated private hire market emerges through Uber’s innovation, it will be hugely valuable, and competition among these upstarts will be fierce. We can expect the winners to be those who make the best use of the data available to them, to improve the service they offer to their customers.

The most successful is likely to be the one which manages to best use the data available to it to improve the service it provides to customers.

Case study - how Uber uses big data - a nice, in-depth case study how they have based their entire business model on big data with some practical examples and some mention of the technology used.

I hope you found this post interesting. I am always keen to hear your views on the topic and invite you to comment with any thoughts you might have.

About : Bernard Marr is a globally recognized expert in analytics and big data. He helps companies manage, measure, analyze and improve performance using data.

His new book is: Big Data: Using Smart Big Data, Analytics and Metrics To Make Better Decisions and Improve Performance You can read a free sample chapter here

DSC Resources

Additional Reading

Follow us on Twitter: @DataScienceCtrl | @AnalyticBridge

Read more…

The 3Vs that define Big Data

Guest blog post by Diya Soubra

As I studied the subject, the following three terms stood out in relation to Big Data.

Variety, Velocity and Volume.

In marketing, the 4Ps define all of marketing using only four terms.
Product, Promotion, Place, and Price.


I claim that the 3Vs above totally define big data in a similar fashion.
These three properties define the expansion of a data set along various fronts to where it merits to be called big data. An expansion that is accelerating to generate yet more data of various types.

The plot above, using three axes helps to visualize the concept.

Data Volume:
The size of available data has been growing at an increasing rate. This applies to companies and to individuals. A text file is a few kilo bytes, a sound file is a few mega bytes while a full length movie is a few giga bytes.
More sources of data are added on continuous basis. For companies, in the old days, all data was generated internally by employees. Currently, the data is generated by employees, partners and customers. For a group of companies, the data is also generated by machines. For example, Hundreds of millions of smart phones send a variety of information to the network infrastructure. This data did not exist five years ago.
More sources of data with a larger size of data combine to increase the volume of data that has to be analyzed. This is a major issue for those looking to put that data to use instead of letting it just disappear.
Peta byte data sets are common these days and Exa byte is not far away.

Large Synoptic Survey Telescope (LSST).
http://lsst.org/lsst/google
“Over 30 thousand gigabytes (30TB) of images will be generated every night during the decade -long LSST sky survey. ”

https://www.youtube.com/t/press_statistics/?hl=en
72 hours of video are uploaded to YouTube every minute

There is a corollary to Parkinson’s law that states: “Data expands to fill the space available for storage.”
http://en.wikipedia.org/wiki/Parkinson’s_law

This is no longer true since the data being generated will soon exceed all available storage space.
http://www.economist.com/node/15557443

Data Velocity:
Initially, companies analyzed data using a batch process. One takes a chunk of data, submits a job to the server and waits for delivery of the result. That scheme works when the incoming data rate is slower than the batch processing rate and when the result is useful despite the delay. With the new sources of data such as social and mobile applications, the batch process breaks down. The data is now streaming into the server in real time, in a continuous fashion and the result is only useful if the delay is very short.

http://blog.twitter.com/2011/03/numbers.html
140 million tweets per day on average.( more in 2012)

I have not yet determined how data velocity may continue to increase since real time is as fast as it gets. The delay for the results and analysis will continue to shrink to also reach real time.

Data Variety:
From excel tables and databases, data structure has changed to loose its structure and to add hundreds of formats. Pure text, photo, audio, video, web, GPS data, sensor data, relational data bases, documents, SMS, pdf, flash, etc etc etc. One no longer has control over the input data format. Structure can no longer be imposed like in the past in order to keep control over the analysis. As new applications are introduced new data formats come to life.

Google uses smart phones as sensors to determine traffic conditions.

http://www.wired.com/autopia/2011/03/cell-phone-networks-and-the-future-of-traffic/
In this application they are most likely reading the speed and position of millions of cars to construct the traffic pattern in order to select the best routes for those asking for driving directions. This sort of data did not exist on a collective scale a few years ago.

The 3Vs together describe a set of data and a set of analysis conditions that clearly define the concept of big data.

 

So what is one to do about this?

So far, I have seen two approaches.
1-divide and concur using Hadoop
2-brute force using an “appliance” such as the SAP HANA
(High- Performance Analytic Appliance)

In the divide and concur approach, the huge data set is broken down into smaller parts (HDFS) and processed (Mapreduce) in a parallel fashion using thousands of servers.
http://www.kloudpedia.com/2012/01/10/hadoop/

As the volume of the data increases, more servers are added and the process runs in the same manner. Need a shorter delay for the result, add more servers again. Given that with the cloud, server power is infinite, it is really just a matter of cost. How much is it worth to get the result in a shorter time.

One has to accept that not ALL data analysis can be done with Hadoop. Other tools are always required.

For the brute force approach, a very powerful server with terabytes of memory is used to crunch the data as one unit. The data set is compressed in memory. For example, for a Twitter data flow that is pure text, the compression ratio may reach 100:1. A 1TB IBM SAP HANA can then load a data set of 100TB in memory and do analytics on it.

IBM has a 100TB unit for demonstration purposes.
http://www.ibm.com/solutions/sap/us/en/landing/hana.html

Many other companies are filling in the gap between these two approaches by releasing all sorts of applications that address different steps of the data processing sequence plus the management and the system configuration.

Read more…

The Data Science Industry: Who Does What

Guest blog post by Laetitia Van Cauwenberge

Interesting infographics produced by DataCamp.com, an organisation offering R and data science training. Click here to see the original version. I would add that one of the core competencies of the data scientist is to automate the process of data analysis, as well as to create applications that run automatically in the background, sometimes in real-time, e.g.

  • to find and bid on millions of Google keywords each day (eBay and Amazon do that, and most of these keywords have little or no historical performance data, so keyword aggregation algorithms - putting keywords in buckets - must be used to find the right bid based on expected conversion rates),
  • buy or sell stocks,
  • monitor networks and generate automated alerts sent to the right people (to warn about a potential fraud, etc.)
  • or to recommend products to a user, identify optimum pricing, manage inventory, or identify fake reviews (a problem that Amazon and Yelp have failed to solve to this day)  

DSC Resources

Additional Reading

Follow us on Twitter: @DataScienceCtrl | @AnalyticBridge

Read more…

Google is a prolific contributor to Open source. Here is a list of 4 open source & cloud projects from Google focusing on analytics, machine learning, data cleansing & visualization.

TensorFlow

TensorFlow™ is an open source software library for numerical computation using data flow graphs. Nodes in the graph represent mathematical operations, while the graph edges represent the multidimensional data arrays (tensors) communicated between them. The flexible architecture allows you to deploy computation to one or more CPUs or GPUs in a desktop, server, or mobile device with a single API. TensorFlow was originally developed by researchers and engineers working on the Google Brain Team within Google's Machine Intelligence research organization for the purposes of conducting machine learning and deep neural networks research, but the system is general enough to be applicable in a wide variety of other domains as well.

OpenRefine

OpenRefine (formerly Google Refine) is a powerful tool for working with messy data: cleaning it; transforming it from one format into another; and extending it with web services and external data.Please note that since October 2nd, 2012, Google is not actively supporting this project, which has now been rebranded to OpenRefine. Project development, documentation and promotion is now fully supported by volunteers.

Google Charts

Google Charts provides a perfect way to visualize data on your website. From simple line charts to complex hierarchical tree maps, the chart gallery provides a large number of ready-to-use chart types.

The most common way to use Google Charts is with simple JavaScript that you embed in your web page. You load some Google Chart libraries, list the data to be charted, select options to customize your chart, and finally create a chart object with an id that you choose. Then, later in the web page, you create a <div> with that id to display the Google Chart.

Automatic Statistician

Making sense of data is one of the great challenges of the information age we live in. While it is becoming easier to collect and store all kinds of data, from personal medical data, to scientific data, to public data, and commercial data, there are relatively few people trained in the statistical and machine learning methods required to test hypotheses, make predictions, and otherwise create interpretable knowledge from this data. The Automatic Statistician project aims to build an artificial intelligence for data science, helping people make sense of their data.


This article is compiled by Jogmon.

Originally posted on Data Science Central

Read more…

Guest blog and great infographic from our friends at DataCamp.com

Nowadays, the data science field is hot, and it is unlikely that this will change in the near future. While a data driven approach is finding its way into all facets of business, companies are fiercely fighting for the best data analytic skills that are available in the market, and salaries for data science roles are going in overdrive.

 Companies’ increased focus on acquiring data science talent goes hand in hand with the creation of a whole new set of data science roles and titles. Sometimes new roles and titles are added to reflect changing needs; other times they are probably created as a creative way to differentiate from fellow recruiters. Either way, it’s hard to get a general understanding of the different job roles, and it gets even harder when you’re looking to figure out what job role could best fit your personal and professional ambitions.

The Data Science Industry: Who Does What

DataCamp took a look at this avalanche of data science job postings in an attempt to unravel these cool-sounding and playful job titles into a comparison of different data science related careers. We summarized the results in our latest infographic “The Data Science Industry: Who Does What”:

In this infographic we compare the roles of data scientists, data analysts, data architects, data engineers, statisticians and many more. We have a look at their roles within companies and the data science process, what technologies they have mastered, and what the typical skillset and mindset is for each role. Furthermore, we look at the top employers that are currently hiring these different data science roles and how the average national salaries of these roles map out against each other.

Hopefully this infographic will help you to better understand the different job roles that are available to data passionate professionals.

The original blog and infographic can be seen here.

Originally posted on Data Science Central

Read more…

Guest blog shared by Stefan Kingham at Medigo.com

This infographic displays data from The Economist Intelligence Unit’s ‘Healthcare Outcomes Index 2014’, which took into account a number of diverse and complex factors to produce a ranking of the world’s best-performing countries in healthcare (outcome).

In order to produce a rounded set of outcome rankings, the EIU used basic factors like life expectancy and infant mortality rates alongside weighted factors such as Disability-Adjusted Life Years (DALYs) and Health-Adjusted Life Expectancy (HALEs), whilst also taking ageing populations and adult mortality rates into consideration.

The EIU also produced an overview of how much countries spend each year on healthcare per capita. This spending ranking is based on data from the World Health Organization (WHO).

By plotting the EIU’s outcome rankings against spending rankings for each country, we are able to develop a global overview of how effectively countries use their healthcare budgets.

See the original post here.

Originally posted on Data Science Central

Read more…

DataViz for Cavemen

The late seventies are considered as prehistoric times by most data scientists. Yet it was the beginning of a new era, with people getting their first personal computer, or at least programmable calculators like the one pictured below. The operating system was called DOS, and later became MSdos, for Microsoft Disk Operating System. You could use your TV set as a monitor, and tapes and a tape recorder (then later floppy disks) to record data. Memory was limited to 64KB. Regarding the HP 41 model below, the advertising claimed that with some extra modules, you could write up to 2,000 lines of code, and save it permanently. I indeed started my career (and even inverted matrices with it), back in high school, with the very model featured below, offered as a birthday present. Math teachers were afraid by these machines, I believed they were banned from schools at some point.

One of the interesting features in these early times was that there was no real graphic device, not for personal use anyway (sure publishers had access to expensive plotting machines back then). So the trick was to produce graphs and images using only ASCII chars. Typical monitors could display 25 lines, each with 40 characters, in fixed font (courier font). More advanced systems would allow you to switch between two virtual screens, thus extending the length of a line to 80 chars. 

Here are some of the marvels that you could produce back then - now this is considered an art. Has anyone ever made a video using just ASCII chars like in this picture? If anything, it shows how big data is shallow: a 1024 x 1024 image (or a video made up of hundreds of such frames) can be compressed by a factor 2,000 or more, and yet it still conveys pretty much all the useful information available in the big, original version. This brings another question: could this technique be used for face recognition?

This is supposed to be Obama - see details

Click here for details or to download this text file (the image)!

Originally posted on Data Science Central

DSC Resources

Additional Reading

Follow us on Twitter: @DataScienceCtrl | @AnalyticBridge

Read more…

Guest blog by Fabio Souto

A curated list of awesome data visualizations frameworks, libraries and software. Inspired by awesome-python.

Table of contents

JavaScript tools

Charting libraries

  • C3 - a D3-based reusable chart library.
  • Chart.js - Charts with the canvas tag.
  • Charted - A charting tool that produces automatic, shareable charts from any data file.
  • Chartist.js - Responsive charts with great browser compatibility.
  • Dimple - An object-oriented API for business analytics.
  • Dygraphs - Interactive line charts library that works with huge datasets.
  • Echarts - Highly customizable and interactive charts ready for big datasets.
  • Epoch - Perfect to create real-time charts.
  • Highcharts - A charting library based on SVG and VML rendering. Free (CC BY-NC) for non-profit projects.
  • MetricsGraphics.js - Optimized for time-series data.
  • Morris.js - Pretty time-series line graphs.
  • NVD3 - A reusable charting library written in d3.js.
  • Peity - A library to create small inline svg charts.
  • TechanJS - Stock and financial charts.

Charting libraries for graphs

  • Cola.js - A tool to create diagrams using constraint-based optimization techniques. Works with d3 and svg.js.
  • Cytoscape.js - JavaScript library for graph drawing maintained by Cytoscape core developers.
  • Linkurious - A toolkit to speed up the development of graph visualization and interaction applications. Based on Sigma.js.
  • Sigma.js - JavaScript library dedicated to graph drawing.
  • VivaGraph - Graph drawing library for JavaScript.

Maps

  • CartoDB - CartoDB is an open source tool that allows for the storage and visualization of geospatial data on the web.
  • Cesium - WebGL virtual globe and map engine.
  • Leaflet - JavaScript library for mobile-friendly interactive maps.
  • Leaflet Data Visualization Framework - A framework designed to simplify data visualization and thematic mapping using Leaflet.
  • Mapsense.js - Combines d3.js with tile maps.
  • Modest Maps - BSD-licensed display and interaction library for tile-based maps in Javascript.

d3

dc.js

dc.js is an multi-Dimensional charting built to work natively with crossfilter.

Misc

  • Chroma.js - A small library for color manipulation.
  • Piecon - Pie charts in your favicon.
  • Recline.js - Simple but powerful library for building data applications in pure JavaScript and HTML.
  • Textures.js - A library to create SVG patterns.
  • Timeline.js - Create interactive timelines.
  • Vega - Vega is a visualization grammar, a declarative format for creating, saving, and sharing interactive visualization designs.
  • Vis.js - A dynamic visualization library including timeline, networks and graphs (2D and 3D).

Android tools

  • HelloCharts - Charting library for Android compatible with API 8+.
  • MPAndroidChart - A powerful & easy to use chart library.

C++ tools

Golang tools

  • Charts for Go - Basic charts in Go. Can render to ASCII, SVG and images.
  • svgo - Go Language Library for SVG generation.

iOS tools

  • JBChartView - Charting library for both line and bar graphs.
  • PNChart - A simple and beautiful chart lib used in Piner and CoinsMan.
  • ios-charts - iOS port of MPAndroidChart. You can create charts for both platforms with very similar code.

Python tools

  • bokeh - Interactive Web Plotting for Python.
  • matplotlib - A python 2D plotting library.
  • pygal - A dynamic SVG charting library.
  • seaborn - A library for making attractive and informative statistical graphics.
  • toyplot - The kid-sized plotting toolkit for Python with grownup-sized goals.

R tools

  • ggplot2 - A plotting system based on the grammar of graphics.
  • rbokeh - R Interface to Bokeh.
  • rgl - 3D Visualization Using OpenGL

Ruby tools

  • Chartkick - Create beautiful JavaScript charts with one line of Ruby.

Other tools

Tools that are not tied to a particular platform or language.

  • Lightning - A data-visualization server providing API-based access to reproducible, web-based, interactive visualizations.
  • RAW - Create web visualizations from CSV or Excel files.
  • Spark - Sparklines for the shell. It have several implementations in different languages.
  • Periscope - Create charts directly from SQL queries.

Resources

Books

Twitter accounts

Websites

Contributing

  • Please check for duplicates first.
  • Keep descriptions short, simple and unbiased.
  • Please make an individual commit for each suggestion
  • Add a new category if needed.

Thanks for your suggestions!

License

CC0

To the extent possible under law, Fabio Souto has waived all copyright and related or neighboring rights to this work.

Originally posted on Data Science Central

Read more…

Guest blog post by Laetitia Van Cauwenberge

Your best references to do your job or get started in data science.

  1. Machine Learning on GitHub
  2. Supervised Learning on GitHub
  3. Cheat Sheet: Data Visualization with R
  4. Cheat Sheet: Data Visualisation in Python
  5. scikit-learn Algorithm Cheat Sheet
  6. Vincent Granville's Data Science Cheat Sheet - Basic
  7. Vincent Granville's Data Science Cheat Sheet - Advanced
  8. Cheat Sheet – 10 Machine Learning Algorithms & R Commands
  9. Microsoft Azure Machine Learning : Algorithm Cheat Sheet
  10. Cheat Sheet – Algorithm for Supervised and Unsupervised Learning
  11. Machine Learning and Predictive Analytics, on Dzone
  12. ML Algorithm Cheat Sheet by Laurence Diane
  13. CheatSheet: Data Exploration using Pandas in Python
  14. 24 Data Science, R, Python, Excel, and Machine Learning Cheat Sheets .

Click here for picture source

DSC Resources

Additional Reading

Follow us on Twitter: @DataScienceCtrl | @AnalyticBridge

Read more…

If you are a data scientist or analyst looking to implement Tableau or join a Tableau implementation company, here are 8 companies you can evaluate.

Neos.hr

Tableau helps anyone quickly analyze, visualize and share information. Rapid-fire business intelligence gives you the
ability to answer your own questions in minutes.

You can work with all kinds of data, from Hadoop to data warehouses to spreadsheets, and across disparate data sets.

  • Your entire organization is served, from executives to analysts, across departments and geographic locations, in the office or on-the-go.

  • It fits in seamlessly as an extension of your IT infrastructure.

  • You can take advantage of the new generation of user-friendly, visual interfaces to spot outliers and trends in complex data.

  • You, and everyone around you, are self-reliant. When it comes to getting answers from data, you don’t have to wait for anyone or anything.

  • And it’s easy on your budget by providing low cost of ownership and a return on your investment in days or weeks, not months or years.

Interworks.com

Tableau puts analytics in the hands of the user. By enabling individual creativity and exploration from the ground floor, businesses now the ability to adapt and outperform competition through intuitive data visualization and analysis.

Tableau can connect to virtually any data source, be it corporate data warehouse, Microsoft Excel or web-based data. It gives users immediate insights by transforming their data into beautiful, interactive visualizations in a matter of seconds. What took expensive teams days or months to develop, now is achieved through the use of a user-friendly drag-and-drop interface.

Boulder Insights

  • Uncover unseen patterns in your data

  • Navigate sophisticated drag and drop dashboards with ease

  • Analyze millions of rows of data ad hoc and in moments

  • Inform smart business decisions

  • Combine multiple data sources for quick analysis

  • View real-time refreshed dashboards

  • Share beautiful and clear dashboards with colleagues

Syntelli Solutions

Tableau’s rapid-fire business intelligence software lets everyone in your organization analyze and understand their data far faster than any other solution and at a fraction of their costs and resources.

  • Ease of Use:

Tableau Desktop lets you create rich visualizations and dashboards with an intuitive, drag-and-drop interface that lets you see every change as you make it. Anyone comfortable with Excel can get up to speed on Tableau quickly.

  • Direct Connect and Go

There is minimum set-up required with Tableau. In minutes you’ll be consolidating numbers, and visualize results without any advance set-up. Helping to free up your IT resources, and enabling you to quickly arrive at results.

  • Perfect Mashups

Connect to data in one click and layer in a second data source with another. Combining data sources in the same view is so easy it feels like cheating.

  • Best Practices in a Box

Tableau has best practices built right in. You get the benefit of years of research on the best way to represent data, from color schemes that convey meaning to an elegant design that keeps users focused on what’s important.


marquis leadership

Tableau lets you use more than one data series to spotlight high and low performers in a single view. The bar chart below shows Sales by Customers, sorted in order of highest sales. I dropped the Profit data onto the Color mark, and, voila! I see not just customers who have the most sales, but also those with highest profits – just by looking at the gradation of colors from red to green. Suddenly it’s easy to see that some of the customers with lower sales are more profitable than customers with higher sales.

TEG Analytics

Tableau is the leading data visualization tool in the Gartner Quadrant And nobody knows Tableau better than us…

  • Business Performance evaluations and performance driver analysis Dashboards for the fortune500 US CPG company

  • Pricing Elasticity Analytics dashboard & simulator for one of the world’s leading shipping and logistics service company

  • Inventory management and inbuilt alert system Dashboards for world’s leading retail company

  • Bug management and planning Dashboards for large IT product development company

  • Campaign management and promotions analysis reporting for an large digital and direct mail agency

DataSelf

Tableau’s most obvious features include its easy, intuitive user interface. Open it and go, anytime and anywhere. Use it on your desktop, on the Web, or on your mobile device. Tableau showed the world what self-service BI and data discovery should be, and it keeps leading the way.

That leadership has been acknowledged year after year by the industry’s touchstone, the Gartner Magic Quadrant. Gartner. It has rated Tableau the most powerful, most intuitive analysis tool on the market for the last four years in a row.

But you don’t see the best benefit until you use it: Users stay in the flow from question to answer to more questions and more answers, all the way to insight. They don’t stop to fiddle with data, they just stay with the analysis.

datatonic

Tableau Software helps people see and understand data. Used by more than 19,000 companies and organizations worldwide, Tableau’s award-winning software delivers fast analytics and rapid-fire business intelligence. Create visualizations and dashboards in minutes, then share in seconds.The result? You get answers from data quickly, with no programming required. Data is everywhere. But most of us struggle to make sense of it.Tableau Software lets anyone visualize data and then share it on the web, no programming needed. It’s wicked-fast, easy analytics.


Originally posted on Data Science Central

Read more…

An Introduction to Data Visualization

Guest blog post by Divya Parmar

After data science, which I discussed in an earlier post, data visualization is one of the most common buzzwords thrown around in the tech and business communities. To demonstrate how one can actually visualize data, I want to use one of the hottest tools in the market right now: Tableau. You can download Tableau Public for free here, and the “Cat vs. Dog” dataset can be found here. Let’s get started.

1. Play around with the data and find what looks interesting.

After opening Tableau Public and importing my Excel file, I looked over my dataset. I was curious to see if there was relationship between the rate of cat ownership and dog ownership. So I put dog ownership on the x-axis and cat ownership on the y-axis; I then added state name as a label. All of this is done through simply dragging and dropping, and below is a snapshot of how intuitive it is.

2. Add some elements as necessary to show your insight.

There are many ways to build on the preliminary step. You can add something like a trend line to demonstrate a statistical relationship (note that there is a p-value with the trend line), which is done through the "Analysis" tab and adds more credibility. You can even give different colors or sizes to different data points, as I have done below using the number of pet households by state to emphasis the larger states.

3. Fix and improve to make usable for export, presentation, or other purpose.

Data visualization is only useful if it is simple and to the point. In the above example, the District of Columbia data point is an outlier that is making the rest of the graph harder to read. You can edit your axis to not show D.C., and can also remove the confidence bands for the trend line to remove unessential information. 

After your visualization is ready, put it to use by sharing, embedding, or whatever means works for you.  Data visualization is easier than you think, and I encourage you to get started.

About: Divya Parmar is a recent college graduate working in IT consulting. For more posts every week, and to subscribe to his blog, please click here.

Read more…

In order to use Big Data effectively, a holistic approach is required. Organizations are now using data analytics at every level, and roles that previously would have had no need to concern themselves with data are now required to have some degree of understanding in order to leverage insights.

Ensuring that data is presented in such a way as to be understood and utilized by all employees is, however, a challenge. Most Big Data actually yields neither meaning nor value, and the sheer volume coming into businesses can be overwhelming. Companies are therefore increasingly moving away from simple 2D Excel charts, and replacing or supplementing them with powerful data visualization tools.

Sophisticated data visualization is a tool that supports analytic reasoning. It accommodates the large numbers of data points provided by Big Data using additional dimensions, colors, animations, and the digital equivalents of familiar items such as dials and gauges. The user can look at the graphics provided to reveal entanglements that may otherwise be missed, and patterns in the data can be displayed to the user at great speeds. Nicholas Marko, Chief Data Officer at Geisinger Health System, notes that: ‘You need to represent a lot of information data points in a small space without losing the granularity of what you're trying to reflect. That's why you're seeing different kinds of graphics, infographics, and more art-like or cartoon-like visualizations. It's certainly better than increasing the density of dots on a graph. At a certain point, you're not getting anything more meaningful.’

There are numerous benefits to data visualization. It reduces dependence on IT, allowing them to focus on adding value and optimizing processes. Increasing the speed that data is analyzed also means that it can be acted upon quicker. Intel’s IT Manager Survey found IT managers expect that 63% of all analytics will be done in real time, and those who cannot understand and act on data in this way, face losing their competitive advantage to others.

Data visualization for mobile, in particular, is becoming increasingly important. The requirement for real time analytics, the ubiquitousness of mobile devices, and the need to have information in real time, means that many data visualization vendors are now either adapting desktop experiences to mobile formats, or taking a mobile-first approach to developing their technology. There are obvious constraints, with smartphone display space limited. Information needs to be presented more simply than on a desktop, so, rather than just translating a complex desktop view into a simpler mobile one, it is important to consider context. Designers are also exploiting gesture-based input to help users easily navigate different views and interact with the data.

Gartner estimated a 30% compound annual growth rate in the use of data analytics through 2015. Visualization data discovery tools offer a tremendous opportunity to manage and leverage the growing volume, variety, and velocity of new and existing data. Using the faster, deeper insights afforded, companies are more agile, and have a significant competitive advantage moving forward. 

Originally posted on Data Science Central

Read more…


Title: How Flextronics uses DataViz and Analytics to Improve Customer Satisfaction
Date: Tuesday, December 15, 2015
Time: 09:00 AM Pacific Standard Time
Duration: 1 hour

Space is limited.
Reserve your Webinar seat now

Flexibility in adapting to changing global markets and customer needs is necessary to stay competitive, and the Flextronics analytics team is tasked with making sure the Flex management team has accurate and up-to-date analytics to optimize their business’s performance, efficiency, and customer service.

In our latest DSC webinar series, Joel Woods from Flextronics’ Global Services and Solutions will share success stories around analytics for the repairs and refurbishment of customer products utilizing analytics and data visualization from Tableau and Alteryx.

You will learn how to:

  • Use data analytics to improve cost savings
  • Resolve common data challenges such as blending disparate data sources
  • Deliver automated and on-demand reporting to clients
  • Provide visualizations that show the analytics that matter to both internal teams and customers

About Flextronics:

Flextronics is an industry leading end-to-end supply chain solutions company with $26 billion in sales, generated from helping customers design, build, ship, and service their products through an unparalleled network of facilities in approximately 30 countries and across four continents. 

Speakers: 

Ross Perez, Sr Product Manager -- Tableau
Joel Woods, Advanced Analytics Lead -- Flex Inc
.
Maimoona Block, Alliance Manager -- Alteryx 

Hosted byBill Vorhies, Editorial Director -- Data Science Central


Again, Space is limited so please register early:
Reserve your Webinar seat now

After registering you will receive a confirmation email containing information about joining the Webinar.
Read more…

Smart Big Data: The All-Important 90/10 Rule

Guest blog post by Bernard Marr

The sheer volumes involved with Big Data can sometimes be staggering. So if you want to get value from the time and money you put into a data analysis project, a structured and strategic approach is very important.

The phenomenon of Big Data is giving us ever-growing volume and variety of data we which we can now store and analyze. Any regular reader of my posts knows that I personally prefer to focus on Smart Data, rather than Big Data - because the term places too much importance on the size of the data. The real potential for revolutionary change comes from the ability to manipulate, analyze and interpret new data types in ever-more sophisticated ways.

Application of the Pareto distribution and 90/10 rule in a related context

The SMART Data Framework

I’ve written previously about my SMART Data framework which outlines a step-by-step approach to delivering data-driven insights and improved business performance.

  1. Start with strategy: Formulate a plan – based on the needs of your business
  2. Measure metrics and data: Collect and store the information you need
  3. Apply analytics: Interrogate the data for insights and build models to test theories
  4. Report results: Present the findings of your analysis in a way that the people who will put them into effect will understand
  5. Transform your business

Understand your customers better, optimize business processes, improve staff wellbeing or increase revenues and profits.

My work involves helping businesses use data to drive business value. Because of this I get to see a lot of half-finished data projects, mothballed when it was decided that external help was needed.

The biggest mistake by far is putting insufficient thought – or neglecting to put any thought – into a structured strategic approach to big data projects. Instead of starting with strategy, too many companies start with the data. They start frantically measuring and recording everything they can in the belief that big data is all about size. Then they get lost in the colossal mishmash of everything they’ve collected, with little idea of how to go about mining the all-important insights.

This is why I have come up with the 90/10 rule – When working with data, 90% of your time should be spent on a structured strategic approach, while 10% of your time should be spent “exploring” the data.

The 90/10 Rule

The 90% structured time should be used putting the steps outlined in the SMART Data framework into operation. Making a logical progression through an ordered set of steps with a defined beginning (a problem you need to solve), middle (a process) and an ending (answers or results).

This is after all why we call it Data Science. Business data projects are very much like scientific experiments, where we run simulations testing the validity of theories and hypothesis, to produce quantifiable results. 

The other 10% of your time can be spent freely playing with your data – mining for patterns and insights which, while they may be valuable in other ways, are not an integral part of your SMART Data strategy.

Yes, you can be really lucky and your data exploration can deliver valuable insights – and who knows what you might find, or what inspiration may come to you? But it should always play second-fiddle to following the structure of your data project in a methodical and comprehensive way.

Always start with strategy

I think this is a very important point to make, because it’s something I often see companies get the wrong way round. Too often, the data is taken as the starting point, rather than the strategy.

Businesses that do this run the very real risk of becoming “data rich and insight poor”. They are in danger of missing out on the hugely exciting benefits that a properly implemented and structured data-driven initiative can bring.

Working in a structured way means “Starting with strategy”, which means identifying a clear business need and what data you will need to solve it. Businesses that do this, and follow it through in a methodical way will win the race to unearth the most valuable and game-changing insights.

I hope you found this post interesting. I am always keen to hear your views on the topic and invite you to comment with any thoughts you might have.

About : Bernard Marr is a globally recognized expert in analytics and big data. He helps companies manage, measure, analyze and improve performance using data.

His new book is: Big Data: Using Smart Big Data, Analytics and Metrics To Make Better Decisions and Improve Performance You can read a free sample chapter here

DSC Resources

Additional Reading

Follow us on Twitter: @DataScienceCtrl | @AnalyticBridge

Read more…

 

Guest blog by Greg Roberts at Packt Publishing

(to see this graph in its fully interactive form see http://gregroberts.github.io/)

I love Python, and to celebrate Packt’s Python Week, I’ve spent some time developing an app using some of my favourite tools. The app is a graph visualisation of Python and related topics, as well as showing where all our content fits in. The topics are all StackOverflow tags, related by their co-occurrence in questions on the site.

The app is available to view at http://gregroberts.github.io/ and in this blog, I’m going to discuss some of the techniques I used to construct the underlying dataset, and how I turned it into an online application using some of my favourite tools.

Graphs, not charts

Graphs are an incredibly powerful tool for analysing and visualising complex data. In recent years, many different graph database engines have been developed to make use of this novel manner of representing data. These databases offer many benefits over traditional, relational databases because of how the data is stored and accessed.

Here at Packt, I use a Neo4j graph to store and analyse data about our business. Using the Cypher query language, it’s easy to express complicated relations between different nodes succinctly.

It’s not just the technical aspect of graphs which make them appealing to work with. Seeing the connections between bits of data visualised explicitly as in a graph helps you to see the data in a different light, and make connections that you might not have spotted otherwise. This graph has many uses at Packt, from customer segmentation to product recommendations. In the next section, I describe the process I use to generate recommendations from the database.

Make the connection

For product recommendations, I use what’s known as a hybrid filter. This considers both content based filtering (product x and y are about the same topic) and collaborative filtering (people who bought x also bought y). Each of these methods has strengths and weaknesses, so combining them into one algorithm provides a more accurate signal.

 

The collaborative aspect is straightforward to implement in Cypher. For a particular product, we want to find out which other product is most frequently bought alongside it. We have all our products and customers stored as nodes, and purchases are stored as edges. Thus, the Cypher query we want looks like this:

 

MATCH (n:Product {title:’Learning Cypher’})-[r:purchased*2]-(m:Product)

WITH m.title AS suggestion, count(distinct r)/(n.purchased+m.purchased) AS alsoBought

WHERE m<>n

RETURN* ORDER BY alsoBought DESC

 

and will very efficiently return the most commonly also purchased product. When calculating the weight, we divide by the total units sold of both titles, so we get a proportion returned. We do this so we don’t just get the titles with the most units; we’re effectively calculating the size of the intersection of the two titles’ audiences relative to their overall audience size.

 

The content side of the algorithm looks very similar:

 

MATCH (n:Product {title:’Learning Cypher’})-[r:is_about*2]-(m:Product)

WITH m.title AS suggestion, count(distinct r)/(length(n.topics)+length(m.topics)) AS alsoAbout

WHERE m<>n

RETURN * ORDER BY alsoAbout DESC

 

Implicit in this algorithm is knowledge that a title is_about a topic of some kind. This could be done manually, but where’s the fun in that?

In Packt’s domain there already exists a huge, well moderated corpus of technology concepts and their usage: StackOverflow. The tagging system on StackOverflow not only tells us about all the topics developers across the world are using, it also tells us how those topics are related, by looking at the co-occurrence of tags in questions. So in our  graph, StackOverflow tags are nodes in their own right, which represent topics. These nodes are connected via edges, which are weighted to reflect their co-occurrence on StackOverflow:

 

edge_weight(n,m) = (# of questions tagged with both n & m)/(# questions tagged with n or m)

[/code]

So, to find topics related to a given topic, we could execute a query like this:

[code]

MATCH (n:StackOverflowTag {name:'Matplotlib'})-[r:related_to]-(m:StackOverflowTag)

RETURN n.name, r.weight, m.name ORDER BY r.weight DESC LIMIT 10

 

Which would return the following:

 

    | n.name | r.weight | m.name          

----+------------+----------+--------------------

 1 | Matplotlib | 0.065699 | Plot            

 2 | Matplotlib | 0.045678 | Numpy           

 3 | Matplotlib | 0.029667 | Pandas          

 4 | Matplotlib | 0.023623 | Python          

 5 | Matplotlib | 0.023051 | Scipy           

 6 | Matplotlib | 0.017413 | Histogram       

 7 | Matplotlib | 0.015618 | Ipython         

 8 | Matplotlib | 0.013761 | Matplotlib Basemap

 9 | Matplotlib | 0.013207 | Python 2.7      

10 | Matplotlib | 0.012982 | Legend         

 

There are many, more complex relationships you can define between topics like this, too. You can infer directionality in the relationship by looking at the local network, or you could start constructing Hyper graphs using the extensive StackExchange API.

 

So we have our topics, but we still need to connect our content to topics. To do this, I’ve used a two stage process.

Step 1 – Parsing out the topics

We take all the copy (words) pertaining to a particular product as a document representing that product. This includes the title, chapter headings, and all the copy on the website. We use this because it’s already been optimised for search, and should thus carry a fair representation of what the title is about. We then parse this document and keep all the words which match the topics we’ve previously imported.

 

#...code for fetching all the copy for all the products

key_re =  '\W(%s)\W' % '|'.join(re.escape(i) for i in topic_keywords)

for i in documents

      tags = re.findall(key_re, i[‘copy’])

      i['tags'] = map(lambda x: tag_lookup[x],tags)

 

Having done this for each product, we have a bag of words representing each product, where each word is a recognised topic.

Step 2 – Finding the information

From each of these documents, we want to know the topics which are most important for that document. To do this, we use the tT-idf algorithm. tT-idf stands for term frequency, inverse document frequency. The algorithm takes the number of times a term appears in a particular document, and divides it by the proportion of the documents that word appears in. The term frequency factor boosts terms which appear often in a document, whilst the inverse document frequency factor gets rid of terms which are overly common across the entire corpus (for example, the term ‘programming’ is common in our product copy, and whilst most of the documents ARE about programming, this doesn’t provide much discriminating information about each document).

 

To do all of this, I use python (obviously) and the excellent scikit-learn library. Tf-idf is implemented in the class sklearn.feature_extraction.text.TfidfVectorizer. This class has lots of options you can fiddle with to get more informative results.

 

import sklearn.feature_extraction.text as skt

tagger = skt.TfidfVectorizer(input = 'content',

                        encoding = 'utf-8',

                        decode_error = 'replace',

                        strip_accents = None,

                        analyzer = lambda x: x,

                        ngram_range = (1,1),

                        max_df = 0.8,

                        min_df = 0.0,

                        norm =  'l2',

                        sublinear_tf = False)

 

It’s a good idea to use the min_df & max_df arguments of the constructor so as to cut out the most common/obtuse words, to get a more informative weighting. The ‘analyzer’ argument tells it how to get the words from each document, in our case, the documents are already lists of normalised words, so we don’t need anything additional done

 

#create vectors of all the documents

vectors = tagger.fit_transform(map(lambda x: x['tags'],rows)).toarray()

#get back the topic names to map to the graph

t_map = tagger.get_feature_names()

jobs = []

for ind, vec in enumerate(vectors):

      features = filter(lambda x: x[1]>0, zip(t_map,vec))

      doc = documents[ind]

      for topic, weight in features:

            job = ‘’’MERGE (n:StackOverflowTag {name:’%s’})

            MERGE (m:Product {id:’%s’})

            CREATE UNIQUE (m)-[:is_about {source:’tf_idf’,weight:%d}]-(n)

’’’ % (topic, doc[‘id’], weight)

            jobs.append(job)

 

We then execute all of the jobs using Py2neo’s Batch functionality.

 

Having done all of this, we can now relate products to each other in terms of what topics they have in common:

 

MATCH (n:Product {isbn10:'1783988363'})-[r:is_about]-(a)-[q:is_about]-(m:Product {isbn10:'1783289007'})

WITH a.name as topic, r.weight+q.weight AS weight

RETURN topic

ORDER BY weight DESC limit 6

 

 

Which returns:

 

   | topic         

---+------------------

1 | Machine Learning

2 | Image         

3 | Models        

4 | Algorithm     

5 | Data          

6 | Python

 

Huzzah! I now have a graph into which I can throw any piece of content about programming or software, and it will fit nicely into the network of topics we’ve developed.

Take a breath

 

So, that’s how the graph came to be. To communicate with Neo4j from Python, I use the excellent py2neo module, developed by Nigel Small. This module has all sorts of handy abstractions to allow you to work with nodes and edges as native Python objects, and then update your Neo instance with any changes you’ve made.

The graph I’ve spoken about is used for many purposes across the business, and has grown in size and scope significantly over the last year. For this project, I’ve taken from this graph everything relevant to Python.

I started by getting all of our content which is_about Python, or about a topic related to python:

 

titles = [i.n for i in graph.cypher.execute('''MATCH (n)-[r:is_about]-(m:StackOverflowTag {name:'Python'}) return distinct n''')]

t2 = [i.n for i in graph.cypher.execute('''MATCH (n)-[r:is_about]-(m:StackOverflowTag)-[:related_to]-(m:StackOverflowTag {name:'Python'}) where has(n.name) return distinct n''')]

titles.extend(t2)

 

then hydrated this further by going one or two hops down each path in various directions, to get a large set of topics and content related to Python.

 

Visualising the graph

 

Since I started working with graphs, two visualisation tools I’ve always used are Gephi and Sigma.js.

 

Gephi is a great solution for analysing and exploring graphical data, allowing you to apply a plethora of different layout options, find out more about the statistics of the network, and to filter and change how the graph is displayed.

 

Sigma.js is a lightweight JavaScript library which allows you to publish beautiful graph visualisations in a browser, and it copes very well with even very large graphs.

 

Gephi has a great plugin which allows you to export your graph straight into a web page which you can host, share and adapt.

 

More recently, Linkurious have made it their mission to bring graph visualisation to the masses. I highly advise trying the demo of their product. It really shows how much value it’s possible to get out of graph based data. Imagine if your Customer Relations team were able to do a single query to view the entire history of a case or customer, laid out as a beautiful graph, full of glyphs and annotations.

Linkurious have built their product on top of Sigma.js, and they’ve made available much of the work they’ve done as the open source Linkurious.js. This is essentially Sigma.js, with a few changes to the API, and an even greater variety of plugins. On Github, each plugin has an API page in the wiki and a downloadable demo. It’s worth cloning the repository just to see the things it’s capable of!

 

Publish It!

So here’s the workflow I used to get the Python topic graph out of Neo4j and onto the web.

 

-Use Py2neo to graph the subgraph of content and topics pertinent to Python, as described above

 

-Add to this some other topics linked to the same books to give a fuller picture of the Python “world”

 

-Add in topic-topic edges and product-product edges to show the full breadth of connections observed in the data

 

-export all the nodes and edges to csv files

 

-import node and edge tables into Gephi.

 

The reason I’m using Gephi as a middle step is so that I can fiddle with the visualisation in Gephi until it looks perfect. The layout plugin in Sigma is good, but this way the graph is presentable as soon as the page loads, the communities are much clearer, and I’m not putting undue strain on browsers across the world!

 

-The layout of the graph has been achieved using a number of plugins. Instead of using the pre-installed ForceAtlas layouts, I’ve used the OpenOrd layout, which I feel really shows off the communities of a large graph. There’s a really interesting and technical presentation about how this layout works here.

 

-Export the graph into gexf format, having applied some partition and ranking functions to make it more clear and appealing.

 

Now it’s all down to Linkurious and its various plugins! You can explore the source code of the final page to see all the details, but here I’ll give an overview of the different plugins I’ve used for the different parts of the visualisation:

First instantiate the graph object, pointing to a container (note the CSS of the container, without this, the graph won’t display properly:

 

<style type="text/css">

  #container {

            max-width: 1500px;

            height: 850px;

            margin: auto;

            background-color: #E5E5E5;

  }

</style>

<div id="container"></div>

<script>

s= new sigma({

            container: 'container',

            renderer: {

                        container: document.getElementById('container'),

                        type: 'canvas'

            },

            settings: {

                        …

            }

});

 

sigma.parsers.gexf - used for (trivially!) importing a gexf file into a sigma instance

 

sigma.parsers.gexf(

            'static/data/Graph1.gexf',

            s,

            function(s) {

                        //callback executed once the data is loaded, use this to set up any aspects of the app which depend on the data

            });

 

-sigma.plugins.filter - Adds the ability to very simply hide nodes/edges based on a callback function. This powers the filtering widget on the page.

 

<input class="form-control" id="min-degree" type="range" min="0" max="0" value="0">

function applyMinDegreeFilter(e) {

                        var v = e.target.value;

                        $('#min-degree-val').textContent = v;

                        filter

                        .undo('min-degree')

                        .nodesBy(

                                    function(n, options) {

                                                return this.graph.degree(n.id) >= options.minDegreeVal;

                                    },{

                                                minDegreeVal: +v

                                    },

                                    'min-degree'

                        )

                        .apply();

};

$('#min-degree').change(applyMinDegreeFilter);

 

-sigma.plugins.locate - Adds the ability to zoom in on a single node or collection of nodes. Very useful if you’re filtering a very large initial graph

 

function locateNode (nid) {

            if (nid == '') {

                        locate.center(1);

            }

            else {

                        locate.nodes(nid);

            }

};

 

-sigma.renderers.glyphs - Allows you to add custom glyphs to each node. Useful if you have many types of node.

 

Outro

This application has been a very fun little project to build. The improvements to Sigma wrought by Linkurious have resulted in an incredibly powerful toolkit to rapidly generate graph based applications with a great degree of flexibility and interaction potential.

 

None of this would have been possible were it not for Python. Python is my right (left, I’m left handed) hand which I use for almost everything. Its versatility and expressiveness make it an incredibly robust Swiss army knife in any data-analysts toolkit.

Do more with Python! This week Packt is celebrating Python with a 50% discount on their leading titles. Take a look at what’s on offer and expand your programming horizons today.

Author Bio:

Greg Roberts is a Data Analyst at Packt Publishing, and has a Masters degree in Theoretical Physics and Applied Maths. Since joining Packt he has developed a passion for working with data, and enjoys learning about new or novel methods of predictive analysis and machine learning. To this end, he spends most of his time fiddling with models in python, and creating ever more complex Neo4j instances, for storing and analysing any and all data he can find. When not writing Python, he is a big fan of reddit, cycling and making interesting noises with a guitar.

You can find Greg on Twitter @GregData

Originally posted on Data Science Central 

Read more…

What Defines a Big Data Scenario?

Guest blog post by Khosrow Hassibi

Big data is a new marketing term that highlights the everincreasing and exponential growth of data in every aspect of our lives. The term big data originated from within the open-source community, where there was an effort to develop analytics processes that were faster and more scalable than traditional data warehousing, and could extract value from the vast amounts of unstructured and semistructured data produced daily by web users. Consequently, big data origins are tied to web data, though today big data is used in a larger context. Read more here ......

Infographic Source, "Data Never Sleeps 2.0"; 

Read more…

Environmental Monitoring using Big Data

Guest blog post by Heinrich von Keler

In this post, I will cover in-depth a Big Data use case: monitoring and forecasting air pollution.

A typical Big Data use case in the modern Enterprise includes the collection and storage of sensor data, executing data analytics at scale, generating forecasts, creating visualization portals, and automatically raising alerts in the case of abnormal deviations or threshold breaches.

This article will focus on an implemented use case: monitoring and analyzing air quality sensor data using Axibase Time-Series Database and R Language.

Steps taken by the data science team to execute the use case:

  • Collect historical data from AirNow into ATSD
  • Stream current data from AirNow into ATSD
  • Use R Language to execute data analytics and generate forecasts for all collected entities and metrics
  • Create Holt-Winters forecasts in ATSD for all collected entities and metrics
  • Build a visualization portal
  • Setup alert and notification rules in the ATSD Rule Engine

The Data

Hourly readings of several key air quality metrics are being generated by over 2,000 monitoring sensor stations located in over 300 cities across the United States, the historical and streaming data is retrieved and stored in ATSD.

The data is provided by AirNow, which is a U.S. government EPA program that protects public health by providing forecast and real-time air quality information.

The two main collected metrics are PM2.5 and Ozone (o3).

PM2.5 is particles less than 2.5 micrometers in diameter, often called “fine” particles. These particles are so small they can be detected only with an electron microscope. Sources of fine particles include all types of combustion, including motor vehicles, power plants, residential wood burning, forest fires, agricultural burning, and industrial processes.

o3 (Ozone) occurs naturally in the Earth’s upper atmosphere, 6 to 30 miles above the Earth’s surface where it forms a protective layer that shields us from the sun’s harmful ultraviolet rays. Man-made chemicals are known to destroy this beneficial ozone.

Other collected metrics are: pm10 (particulate matter up to 10 micrometers in size), co (Carbon Monoxide), no2 (nitrogen dioxide) and so2 (sulfur dioxide).

Collecting/Streaming the Data

A total of 5 years of historical data has been collected, stored, analyzed and accurately forecast. In order for the forecasts to have maximum accuracy, account for trends and for seasonal cycles, at least 3 to 5 years of detailed historical data is recommended.

An issue with the accuracy of the data was immediately determined. The data was becoming available with a fluctuating time delay of 1 to 3 hours. An analysis was conducted by collecting all values for each metric and entity, resulting in several data points being recorded for the same metric, entity and time. This led us to believe that there was both a time delay and stabilization period. Below are the results:

Once available, the data then took another 3 to 12 hours to stabilize, meaning that the values were fluctuating during that time frame for most data points.

As a result of this analysis, it was decided, that all data will be collected with a 12 hour delay in order to increase the accuracy of the data and forecasts.

Axibase Collector was used to collect the data from monitoring sensor stations and stream into Axibase Time-Series Database.

In Axibase Collector a job was setup to collect data from the air monitoring sensor stations in Fresno, California. For this particular example, Fresno was selected because it is considered one of the most polluted cities in the United States, with air quality warnings being often issued to the public.

The File Job sets up a cron task that runs at a specified interval to collect the data and batch upload it into ATSD.

The File Forwarding Configuration is a parser configuration for data incoming from an external source. The path to the external data source is specified, a default entity is assigned to the Fresno monitoring sensor station, start time and end time determine the time frame for retrieving new data (end time syntax is used).

Once these two configurations are saved, the collector starts streaming fresh data into ATSD.

The entities and metrics streamed by the collector into ATSD can be viewed from the UI.

The whole data-set currently has over 87,000,000 records for each metric, all stored in ATSD.

Generating Forecasts in R

The next step was to analyze the data and generate accurate forecasts. Built-in Holt-Winters and Arima algorithms were used in ATSD and custom R language data forecasting algorithms were used for comparison.

To analyze the data in R, the R language API client was used to retrieve the data and then save the custom forecasts back into ATSD.

Forecasts were built for all metrics for the period  of May, 11 until June, 1.

The steps taken to forecast the pm2.5 metric will be highlighted.

The Rssa package was used to generate the forecast. This package implements Singular Spectrum Analysis (SSA) method.

Recommendations from the following sources were used to choose parameters for SSA forecasting:

The following steps were executed when building the forecasts:

pm2.5 series was retrieved from ATSD using the query() function. 72 days of data were loaded.
SSA decomposition was built with a window of 24 days and 100 eigen triples:

dec <- ssa(values, L = 24 * 24, neig = 100)


eigen values, eigen vectors, pairs of sequential eigen vectors and w-correlation matrix of the decomposition were graphed:


plot(dec, type = "values")

plot(dec, type = "vectors", idx = 1:20)

plot(dec,type = "paired", idx = 1:20)

plot(wcor(dec), idx = 1:100)

A group of eigen triples was then selected to use when forecasting. The plots suggest several options.

Three different options were tested: 1, 1:23, and 1:35, because groups 1, 2:23 and 24:35 are separated from other eigen vectors, as judged from the w-correlation matrix.

The rforecast() function was used to build the forecast:

rforecast(x = dec, groups = 1:35, len = 21 * 24, base = "original")


Tests were run with vforecast(), and bforecast() using different parameters, but rforecast() was determined to be the best option in this case.

Graph of the original series and three resulting forecasts:

Forecast with eigen triples 1:35 was selected as the most accurate and saved into ATSD.

  • To save forecasts into ATSD the save_series() function was used.

Generating Forecasts in ATSD

The next step was to create a competing forecast in ATSD using the built-in forecasting features. Majority of the settings were left in automatic mode, so the system itself determines the best parameters (based on the historical data) when generating the forecast.

Visualizing the Results

To visualize the data and forecasts, a portal was created using the built-in visualization features.

Thresholds have been set for each metric, in order to alert the user when either the forecast or actual data are reaching unhealthy levels of air pollution.

When comparing the R forecasts and ATSD forecasts to the actual data, the ATSD forecasts turned out to be significantly more accurate in most cases, learning and recognizing the patterns and trends with more certainty. Until this point in time, as the actual data is coming in, it is following the ATSD forecast very closely, any deviations are minimal and fall within the confidence interval.

It is clear that the built-in forecasting of ATSD often produces more accurate results than even one of the most advanced R language forecasting algorithms that was used as part of this use case. It is absolutely possible to rely on ATSD to forecast air pollution for few days/weeks into the future.

You can keep track of how these forecasts perform in comparison to the actual data in Chart Lab.

Alerts and Notifications

A smart alert notification was setup in the Rule Engine to notify the user by email if the pollution levels breach the set threshold or deviate from the ATSD forecast.

Analytical rules set in Rule Engine for pm2.5 metric – alerts will be raised if the streaming data satisfies one of the rules:

value > 30 - Raise an alert if last metric value exceeds threshold

forecast_deviation(avg()) > 2 - Raise an alert if the actual values exceeds the forecast by more than 2 standard deviations, see image below. Smart rules capture extreme spikes in air pollution.

At this point the use case is fully implemented and will function autonomously; ATSD automatically streams the sensor data, generates a new forecast every 24 hours for 3 weeks into the future and raises alerts if the pollution levels rise above the threshold or if a negative trend is discovered.

Results and Conclusions

The results of this use case are useful for travelers, for whom it is important to have an accurate forecast of environmental and pollution related issues that they may face during their visits or for expats moving to work in a new city or country. Studies have proven that long-term exposure to high levels of pm2.5 can lead to serious health issues.

This research and environmental forecasting is especially valuable in regions like China, where air pollution is seriously affecting the local population and visitors. In cities like ShanghaiBeijing and Guangzhou, pm2.5 levels are constantly fluctuating from unhealthy to critical levels and yet accurate forecasting is limited. Pm2.5 forecasting is critical for travelers and tourists who need to plan their trips during periods of lower pollution levels due to potential health risks associated with exposure to this sort of pollution.

Government agencies can also take advantage of pollution monitoring to plan and issue early warnings to travelers and locals, so that precautions can be taken to prevent exposure to unhealthy levels of pm2.5 pollution. Detecting a trend and raising an alert prior to pm2.5 levels breaching the unhealthy threshold is critical for public safety and health. Having good air quality data and performing data analytics can allow people to adapt and make informed decisions.

Big Data Analytics is an empowerment tool that can put valuable information in the hands of corporations, governments and individuals, and that knowledge can help motivate or give people tools to stimulate change. Air pollution is currently affecting the lives of over a billion people across the globe and with current trends the situation will only get worse. Often the exact source of the air pollution, how it’s interacting in the air and how it’s dispersing cannot be determined, the lack of such information makes it a difficult problem to tackle. With advances in modern technologies and new Big Data solutions, it is becoming possible to combine sensor data with meteorological satellite data to perform extensive data analytics and forecasting. Through Big Data analytics it will be possible to pinpoint the pollution source and dispersion trends days in advanced.

I sincerely believe that Big Data has a large role to play in tackling air pollution and that in the coming years advanced data analytics will be a key tool influencing government decisions and regulation change.

You can learn more about Big Data analytics, forecasting and visualization at Axibase.

Read more…

Time Series Analysis using R-Forecast package

Guest blog post by suresh kumar Gorakala

In today’s blog post, we shall look into time series analysis using R package – forecast. Objective of the post will be explaining the different methods available in forecast package which can be applied while dealing with time series analysis/forecasting.

What is Time Series?

A time series is a collection of observations of well-defined data items obtained through repeated measurements over time. For example, measuring the value of retail sales each month of the year would comprise a time series.

Objective:

  • Identify patterns in the data – stationarity/non-stationarity. 
  • Prediction from previous patterns.

Time series Analysis in R:

My data set contains data of Sales of CARS from Jan-2008 to Dec 2013.

Problem Statement: Forecast sales for 2013

MyData[1,1:14]

PART

Jan08

FEB08

MAR08

....

....

NOV12

DEC12

MERC

100

127

56

....

....

776

557

Table: shows the first row data from Jan 2008 to Dec 2012

Results:

The forecasts of the timeseries data will be:

Assuming that the data sources for the analysis are finalized and cleansing of the data is done, for further details,

Step1: Understand the data: 

As a first step, Understand the data visually, for this purpose, the data is converted to time series object using ts(), and plotted visually using plot() functions available in R.

ts = ts(t(data[,7:66])) 

plot(ts[1,],type=’o’,col=’blue’) 

Image above shows the monthly sales of an automobile

Forecast package & methods:

Forecast package is written by Rob J Hyndman and is available from CRAN here. The package contains Methods and tools for displaying and analyzing univariate time series forecasts including exponential smoothing via state space models and automatic ARIMA modelling.

Before going into more accurate Forecasting functions for Time series, let us do some basic forecasts using Meanf(), naïve(), random walk with drift – rwf() methods. Though these may not give us proper results but we can use the results as bench marks.

All these forecasting models returns objects which contain original series, point forecasts, forecasting methods used residuals. Below functions shows three methods & their plots.

Library(forecast)

mf = meanf(ts[,1],h=12,level=c(90,95),fan=FALSE,lambda=NULL)

plot(mf)

 



mn = naive(ts[,1],h=12,level=c(90,95),fan=FALSE,lambda=NULL) 

plot(mn)

 



md = rwf(ts[,1],h=12,drift=T,level=c(90,95),fan=FALSE,lambda=NULL) 

plot(md) 

Measuring accuracy:

 Once the model has been generated the accuracy of the model can tested using accuracy(). The Accuracy function returns MASE value which can be used to measure the accuracy of the model. The best model is chosen from the below results which gives have relatively lesser values of ME,RMSE,MAE,MPE,MAPE,MASE.

> accuracy(md)

                                         ME     RMSE       MAE          MPE    MAPE     MASE

Training set      1.806244e-16 2.445734 1.889687 -41.68388 79.67588 1.197689

accuracy(mf)

                                        ME      RMSE        MAE         MPE     MAPE MASE

Training set        1.55489e-16  1.903214 1.577778 -45.03219 72.00485         1

> accuracy(mn)

                              ME   RMSE       MAE         MPE      MAPE     MASE

Training set 0.1355932 2.44949 1.864407 -36.45951 76.98682 1.181666

 Step2: Time Series Analysis Approach:

A typical time-series analysis involves below steps:

  • Check for identifying under lying patterns - Stationary & non-stationary, seasonality, trend. 
  • After the patterns have been identified, if needed apply Transformations to the data – based on Seasonality/trends appeared in the data.
  • Apply forecast() the future values using Proper ARIMA model obtained by auto.arima() methods.

Identify Stationarity/Non-Stationarity:

A stationary time series is one whose properties do not depend on the time at which the series is observed. Time series with trends, or with seasonality, are not stationary.

The stationarity /non-stationarity of the data can be known by applying Unit Root Tests - augmented Dickey–Fuller test (ADF), Kwiatkowski-Phillips-Schmidt-Shin (KPSS) test.

ADF: The null-hypothesis for an ADF test is that the data are non-stationary. So large p-values are indicative of non-stationarity, and small p-values suggest stationarity. Using the usual 5% threshold, differencing is required if the p-value is greater than 0.05.

 library(tseries)

adf = adf.test(ts[,1])

adf

        Augmented Dickey-Fuller Test

data:  ts[, 1]

Dickey-Fuller = -4.8228, Lag order = 3, p-value = 0.01

alternative hypothesis: stationary

The above figure suggests us that the data is of stationary and we can go ahead with ARIMA models.

 

KPSS: Another popular unit root test is the Kwiatkowski-Phillips-Schmidt-Shin (KPSS) test. This reverses the hypotheses, so the null-hypothesis is that the data are stationary. In this case, small p-values (e.g., less than 0.05) suggest that differencing is required.

kpss = kpss.test(ts[,1])

Warning message:

In kpss.test(ts[, 1]) : p-value greater than printed p-value

kpss

        KPSS Test for Level Stationarity

data:  ts[, 1]

KPSS Level = 0.1399, Truncation lag parameter = 1, p-value = 0.1

Differencing:

Based on the unit test results we identify whether the data is stationary or not. If the data is stationary then we choose optimal ARIMA models and forecasts the future intervals. If the data is non- stationary, then we use Differencing - computing the differences between consecutive observations. Use ndiffs(),diff() functions to find the number of times differencing needed for the data &  to difference the data respectively.

ndiffs(ts[,1])

[1] 1

diff_data = diff(ts[,1])

Time Series:

Start = 2

End = 60

Frequency = 1

 [1]  1  5 -3 -1 -1  0  3  1  0 -4  4 -5  0  0  1  1  0  1  0  0  2 -5  3 -2  2  1 -3  0  3  0  2 -1 -5  3 -1

[36] -1  2 -1 -1  5 -2  0  2 -2 -4  0  3  1 -1  0  0  0 -2  2 -3  4 -3  2  5

 

Now retest for stationarity by applying acf()/kpss() functions if the plots shows us the Stationarity then Go ahead by applying ARIMA Models.

Identify Seasonality/Trend:

The seasonality in the data can be obtained by the stl()when plotted

Stl = Stl(ts[,1],s.window=”periodic”)

Series is not period or has less than two periods

 

Since my data doesn’t contain any seasonal behavior I will not touch the Seasonality part.

ARIMA Models:

For forecasting stationary time series data we need to choose an optimal ARIMA model (p,d,q). For this we can use auto.arima() function which can choose optimal (p,d,q) value and return us. Know more about ARIMA from here.

auto.arima(ts[,2])

Series: ts[, 2]

ARIMA(3,1,1) with drift        

Coefficients:

          ar1      ar2      ar3      ma1   drift

      -0.2621  -0.1223  -0.2324  -0.7825  0.2806

s.e.   0.2264   0.2234   0.1798   0.2333  0.1316

sigma^2 estimated as 41.64:  log likelihood=-190.85

AIC=393.7   AICc=395.31   BIC=406.16

 

Forecast time series:

 

Now we use forecast() method to forecast the future events.

forecast(auto.arima(dif_data))   Point Forecast     Lo 80      Hi 80     Lo 95    Hi 9561   -3.076531531 -5.889584 -0.2634795 -7.378723 1.22566062    0.231773625 -2.924279  3.3878266 -4.594993 5.05854063    0.702386360 -2.453745  3.8585175 -4.124500 5.52927264   -0.419069906 -3.599551  2.7614107 -5.283195 4.44505565    0.025888991 -3.160496  3.2122736 -4.847266 4.89904466    0.098565814 -3.087825  3.2849562 -4.774598 4.97172967   -0.057038778 -3.243900  3.1298229 -4.930923 4.81684668    0.002733053 -3.184237  3.1897028 -4.871317 4.87678369    0.013817766 -3.173152  3.2007878 -4.860232 4.88786870   -0.007757195 -3.194736  3.1792219 -4.881821 4.866307

plot(forecast(auto.arima(dif_data)))

The below flow chart will give us a summary of the time series ARIMA models approach:

The above flow diagram explains the steps to be followed for a time series forecasting

Please visit my blog dataperspective for more articles

Read more…

Webinar Series

Follow Us

@DataScienceCtrl | RSS Feeds

Careers