The buzz word “Big Data” has taken to everyone’s fancy these days. Go to any forum related to IT or any job portals or attend any conference or read any article, “Big Data” is everywhere, literally.
So what exactly is Big Data?
To simply put it, Big Data is data which cannot be processed by the current tools or technologies. Big Data is too Big, too Fast and too Varied.
Volume. Velocity. Variety.
It would be wrong to say that Big Data is a new concept; Big Data has existed since long, only the power to process it has been with large organizations with enough money to spend on it.
Companies like Google always had the access to Big Data and had the resources to process and analyze it.
With lower hardware costs, evolution of cloud technologies and with help of open source frameworks and software, the resources to process Big Data is now available to independent small businesses, startups and even individuals. These facts have lead to various new developments and innovations in Big Data technology and solutions in the recent past.
How is (Big) Data useful to organizations?
Organizations whether big or small have always been dependent on the feedback data they receive from their customers to continuously improve and sell their products. Various sampling surveys, feedback forms, marketing campaigns have been a part of this feedback loop.
With the advent of Web2.0 these organizations have shifted the feedback gathering process online. With advances in Information Technology and reduction in hardware costs, people are increasingly & constantly connected to each other using technology, generating plenty of data individually.
Another major source of data is the data generated by machines, with computer chips becoming ubiquitous, data is collected everywhere.
The mountains of data can have lot of hidden information which was previously very expensive to reveal. If this data is processed swiftly, it could give enormous insights on consumer behavior. Organizations can use the data for analysis, helping them improve their existing products and development of new products.
What are the challenges?
Big Data as such is everywhere with no common patterns. It could be streams of data from various social networks, weather data, traffic information, machine logs, online entertainment, financial transactions, census data and a lot more.
All this is Big Data, which can be characterized by three attributes:
These three attributes together define “Big Data”.
Let us understand all the three attributes to get more insights.
In the year 2000, a total of 8,00,000 PB of data was stored globally.
Today a total of 2.5 quintillion bytes (Exabyte) is generated everyday! And in all probability by the time you read this blog, the number would have increased by a few Exabytes.
For all the mortals having Byte Problems, here is what it means.
. 1 Bit = Binary Digit
· 8 Bits = 1 Byte
· 1000 Bytes = 1 Kilobyte
· 1000 Kilobytes = 1 Megabyte
· 1000 Megabytes = 1 Gigabyte
· 1000 Gigabytes = 1 Terabyte
· 1000 Terabytes = 1 Petabyte
· 1000 Petabytes = 1 Exabyte
· 1000 Exabytes = 1 Zettabyte
· 1000 Zettabytes = 1 Yottabyte
· 1000 Yottabytes = 1 Brontobyte
· 1000 Brontobytes = 1 Geopbyte
This volume is something which the conventional IT systems of today simply cannot handle.
The conventional Database technologies allow processing of data in batches, it could take days if not weeks to process one batch of Big Data.
Today’s Big Data technologies have evolved, and data can now be processed in massively parallel processing architecture. The MapReduce framework introduced by Google played a major role in evolving the Parallel Processing architecture which forms the base of Big Data processing. We will talk about the Big Data technologies in a later post.
Big Data is generated from social networks, various sensors installed at store entrances, traffic lights, in airplanes, Car GPS and countless other sources. All of this data is in varied formats, sometimes not digestible by the existing systems in their current form.
These variations in the formats make it differ from the way current systems store the data, which is stored in a well defined schema in a Relational Database. All the data fits in nicely and is easy to understand and analyze. These systems with static schemas cannot handle variety.
The success of the organization depends on analyzing the data in variety of formats and make business sense out of it. Big Data processing helps organizations to take this unstructured data and extract meaningful information, which can be processed by humans or structured databases.
Just as the Volume and Variety of data generated has changed, so has the Velocity with which this data is generated. The pace, at which the data is generated today, makes it virtually impossible for the conventional systems to handle it.
Twitter generates around 5 Giga Bytes of data per minute or 7 Terra Bytes in a day, Facebook generates 7 Giga Bytes of data per minute or 10 Terra Bytes of data daily. There are numerous other organizations which generate data at equally faster rates.
This means that data is constantly flowing and new information is received every second. The current technologies may allow storing this streaming data, but the challenge lies in analyzing this data while still in flow and to make business sense out of it.
Organizations looking for competitive advantage over each other therefore want this analysis done in seconds or even micro seconds, faster than their competitors. For example, consider the case of financial markets where a fraction of a second can help organizations make enormous profits.
This Need for Speed has led to development of various Big Data streaming technologies and fast retrieval technologies like the key-value stores and columnar databases for static data.
In this post I tried to give you an overview of Big Data. Big data is Big, and there is a lot more to share and learn. The Big data technologies, just like the data are evolving fast and there are enormous opportunities in it for everyone.
In the next post we will have an overview of the various technologies and frameworks for Big Data.