What is Big Data
Data which is beyond our storage capacity and beyond our Processing is Big Data
For example data generated by
- Social Network like Facebook,linkedin..etc
- Online Shoping and many others…
Currently,of the total amount of data that we have today 90% is generated in last few years where as 10% of the data is previously generated.
Lets get back in flash and recall the configuration we use to have in our Desktop:
In the year around 1990
Our Desktop use to have configuration as
1GB to 20 GB Hardisk for storage
64 to 128 MB RAM
10 kbps of Data Transfer speed.
1TB to 3 TB Hardisk for storage
4GB to 8GB RAM
50 mbps of Data Transfer speed.
This is because now a days we have huge data, for example the image size is high because of High resolution Cameras,Movie file are large because of HD and Blue Ray and simillarly others…and we have to store all these data and its a must so the data cannot be discarded
With time, data volume is growing exponentially. Earlier we used to talk about Megabytes or Gigabytes. But time has arrived when we talk about data volume in terms of terabytes, petabytes and also zettabytes! Global data volume was around 1.8ZB in 2011 and is expected to be 7.9ZB in 2015. It is also known that the global information doubles in every two years!
So Big Data is nothing but an assortment of such a huge and complex data that it becomes very tedious to capture, store, process, retrieve and analyse it with the help of on-hand database management tools or traditional data processing techniques.
According to IBM, their are three characteristics of Big Data
- Volume: Facebook generating 500+ terabytes of data per day.Data in GB,TB,PB and so on..
- Velocity: Analyzing 2 million records each day to identify the reason for losses.
- Variety: unstructured and semi-structured data( images, audio, video, sensor data, log files, etc).
Now lets understand what is structured and unstructured data
Structured data is the data that is easily identifiable as it is organized in a structure. The most common form of structured data is a database where specific information is stored in tables, that is, rows and columns. Unstructured data refers to any data that cannot be identified easily. It could be in the form of images, videos, documents, email, logs and random text. It is not in the form of rows and columns.
Here comes the need Hadoop
Everyday a large amount of unstructured data is getting dumped into our machines. The biggest challenge is not to store large data sets in our systems but to retrieve and analyze the big data in the organizations, that too data present in different machines at different locations that too with good performance. In this situation a necessity for Hadoop arises. Hadoop has the ability to analyze the data present in different machines at different locations very quickly and in a very cost effective way.
Characteristics of Hadoop framework
Hadoop framework is written in Java. It is designed to solve problems that involve analyzing large data (e.g. petabytes). The programming model is based on Google’s MapReduce. The infrastructure is based on Google’s Big Data and Distributed File System. Hadoop handles large files/data throughput and supports data intensive distributed applications. Hadoop is scalable as more nodes can be easily added to it.