1. What do we mean by Hadoop and Hadoop eco-system?
2. Will Hadoop solve my problems?
3. Which Hadoop distribution will fit our requirements best?
In this blog post I have tried to answer the above questions. If you don’t know what Big Data is, then how do you qualify whether your large data aggregation is Big Data or not? This blog post Big Data Explained by Varoon Rajani, might help you understand that.
What do we mean by Hadoop and Hadoop eco-system?
Apache Hadoop is an open source framework that allows for distributed processing of large data sets across computing clusters and is the most widely used technology for Big Data processing.
Hadoop framework has evolved into a set of tools and technologies to efficiently process, store and analyze huge amounts of varied data in a linear, scalable and reliable fashion.
[See How AWS Cloud Makes Hadoop a Piece of Cake with Elastic MapReduce]
Apache Hadoop has two major projects:
- MapReduce: A framework for cluster resource management for parallel processing of large sets of data
- Hadoop Distributed File System (HDFS): A distributed File System for high-throughput access to large sets of data
Hadoop Ecosystem is rapidly evolving with large number of community contributors. The following diagram gives an overview of the Hadoop ecosystem.
Hadoop Distribution Ecosystem
Some of the ecosystem components are explained below:
- Hive: A data warehouse infrastructure with SQL like querying capabilities on Hadoop Data Sets
- Pig: A high level data flow language and execution framework for parallel computation
- ZooKeeper: A high performance coordination service for distributed applications
- Mahout: A scalable machine learning and data mining library
- HBase: A scalable, distributed database that supports structured data storage for large tables
Will Hadoop solve my problem?
It is important to understand that Hadoop is not a complete replacement for the traditional enterprise Data Warehousing and Business Intelligence tools, but is a complementary approach to solve some of its challenges.
Hadoop is best suited for:
- Processing unstructured data
- Complex parallel information processing
- Large Data Sets/Files
- Machine Learning
- Critical fault tolerant data processing
- Reports not needed in real time
- Queries that cannot be expressed by SQL
- Data processing Jobs needs to be faster
Some of the use cases for different industries are :
- Social Media Engagement and Clickstream Analysis (Web Industry): A clickstream is the recording of the parts of the screen a computer user clicks on web while browsing or using another software application. Clickstream analysis is useful for web activity analysis, and customer behaviour software testing, market research, and even for analyzing employee productivity.
- Content Optimization and Engagement (Media Industry): Content required to be optimized for rendering on different devices supporting different content formats. Media companies require large amount of content to be processed in different formats. Also content engagement models need to be mapped for feedback and enhancements.
- Network Analytics and Mediation (Telecommunication Industry): Telecommunication companies generate a large amount of data in the form of usage transaction data, network performance data, cell-site information device level data and other forms of back office data. The real time analytics plays a critical role in reducing the OPEX and enhancing the user experience
- Targeting and Product Recommendation (Retail Industry): The retail companies and e-Commerce companies model the data from different sources to target customers and provide product recommendations based on end user’s profile and usage patterns.
- Risk Analysis, Fraud Monitoring and Capital Market Analysis (BFSI Industry): Banking and finance sectors have large sets of structured and unstructured data generated by different sources like trading pattern in capital markets, consumer behaviour for banking services etc. Financial institutions use big data to perform Risk Analysis, Fraud Monitoring and Tracking, Capital Market Analysis, converged data management etc.
The list is really long and specific to requirements, and the good news is that enterprises can profit from structured / unstructured data.
[What are those 5 Major Points to Consider on Your Enterprise Roadmap to Cloud?]
Which Hadoop distribution will fit in our requirements?
There are a lot of open-source and paid distributions available for implementing Hadoop apart from Apache’s Hadoop open-source distribution. Each Hadoop deployment will implement some or all of the tools listed above depending on the project requirements.
Comparison of three major Hadoop Distributions
Below matrix shows comparison on three major Hadoop distributions (1) Amazon Hadoop Distribution, (2) MapR Hadoop Distribution and (3) Cloudera Hadoop Distribution based on four broad parameters
- Technical details – This covesr base Hadoop Version, File System support, Job Scheduling Support etc.
- Ease of Deployment – Availability of toolkits to manage deployment
- Ease of Maintenance – Cluster management and tools for orchestration.
- Cost – The Cost of implementation for particular Hadoop distribution, billing model and licenses.
Big Data, Hadoop Distributions Comparsion, BlazeClan Analysis
Based on the analysis and mapping the enterprise requirement with the above matrix it will be easy to decide the type of Hadoop distribution best suited for your use case. Please note in above matrix we have compared only three major Hadoop distributions but there many others in the market.
BlazeClan help enterprises profit from the large data sets by implementing right Hadoop Distributions.