Guest blog post by Diya Soubra
As I studied the subject, the following three terms stood out in relation to Big Data.
Variety, Velocity and Volume.
In marketing, the 4Ps define all of marketing using only four terms.
Product, Promotion, Place, and Price.
I claim that the 3Vs above totally define big data in a similar fashion.
These three properties define the expansion of a data set along various fronts to where it merits to be called big data. An expansion that is accelerating to generate yet more data of various types.
The plot above, using three axes helps to visualize the concept.
The size of available data has been growing at an increasing rate. This applies to companies and to individuals. A text file is a few kilo bytes, a sound file is a few mega bytes while a full length movie is a few giga bytes.
More sources of data are added on continuous basis. For companies, in the old days, all data was generated internally by employees. Currently, the data is generated by employees, partners and customers. For a group of companies, the data is also generated by machines. For example, Hundreds of millions of smart phones send a variety of information to the network infrastructure. This data did not exist five years ago.
More sources of data with a larger size of data combine to increase the volume of data that has to be analyzed. This is a major issue for those looking to put that data to use instead of letting it just disappear.
Peta byte data sets are common these days and Exa byte is not far away.
Large Synoptic Survey Telescope (LSST).
“Over 30 thousand gigabytes (30TB) of images will be generated every night during the decade -long LSST sky survey. ”
72 hours of video are uploaded to YouTube every minute
There is a corollary to Parkinson’s law that states: “Data expands to fill the space available for storage.”
This is no longer true since the data being generated will soon exceed all available storage space.
Initially, companies analyzed data using a batch process. One takes a chunk of data, submits a job to the server and waits for delivery of the result. That scheme works when the incoming data rate is slower than the batch processing rate and when the result is useful despite the delay. With the new sources of data such as social and mobile applications, the batch process breaks down. The data is now streaming into the server in real time, in a continuous fashion and the result is only useful if the delay is very short.
140 million tweets per day on average.( more in 2012)
I have not yet determined how data velocity may continue to increase since real time is as fast as it gets. The delay for the results and analysis will continue to shrink to also reach real time.
From excel tables and databases, data structure has changed to loose its structure and to add hundreds of formats. Pure text, photo, audio, video, web, GPS data, sensor data, relational data bases, documents, SMS, pdf, flash, etc etc etc. One no longer has control over the input data format. Structure can no longer be imposed like in the past in order to keep control over the analysis. As new applications are introduced new data formats come to life.
Google uses smart phones as sensors to determine traffic conditions.
In this application they are most likely reading the speed and position of millions of cars to construct the traffic pattern in order to select the best routes for those asking for driving directions. This sort of data did not exist on a collective scale a few years ago.
The 3Vs together describe a set of data and a set of analysis conditions that clearly define the concept of big data.
So what is one to do about this?
So far, I have seen two approaches.
1-divide and concur using Hadoop
2-brute force using an “appliance” such as the SAP HANA
(High- Performance Analytic Appliance)
In the divide and concur approach, the huge data set is broken down into smaller parts (HDFS) and processed (Mapreduce) in a parallel fashion using thousands of servers.
As the volume of the data increases, more servers are added and the process runs in the same manner. Need a shorter delay for the result, add more servers again. Given that with the cloud, server power is infinite, it is really just a matter of cost. How much is it worth to get the result in a shorter time.
One has to accept that not ALL data analysis can be done with Hadoop. Other tools are always required.
For the brute force approach, a very powerful server with terabytes of memory is used to crunch the data as one unit. The data set is compressed in memory. For example, for a Twitter data flow that is pure text, the compression ratio may reach 100:1. A 1TB IBM SAP HANA can then load a data set of 100TB in memory and do analytics on it.
IBM has a 100TB unit for demonstration purposes.
Many other companies are filling in the gap between these two approaches by releasing all sorts of applications that address different steps of the data processing sequence plus the management and the system configuration.