Big Data is a collection of tools, techniques and technologies that allow you to work productively with data of any given volume and variety, rapidly. The Big Data management can be divided in to 4 phases:
Big Data Management Process
Data Generation
Cost of data generation is falling rapidly which has resulted in vast amount of data being generated. Data is generated from multiple sources in multiple different formats.
While data generation step happens at different sources, you need to have your own resources to collect, store, analyze and share the data. Traditional hardware and software technologies are not capable of handling such high volumes of data generated in multiple formats.
To process Big Data, you require software that is designed for distribution, with easy programming model and independent of the underlying platform. One example of such a software ecosystem is Hadoop.
Similarly, you also require vast amount of hardware infrastructure to process Big Data. Hardware has to be scalable and distributed. The advent of Cloud Computing has made massively scalable infrastructure available at much cheaper costs compared to the traditional hardware.
The key characteristics of Cloud Computing like, Elasticity, Scalability, Pay per use and No CapEx make it the perfect match for handling Big Data.
Amazon Web Services or AWS Cloud Computing platform offers multiple services that can help with Data Collection, Storage, Processing and Sharing. Let us have a look at some of these services that help with each of these phases of Big Data management.
Data Collection & Storage
AWS Import/Export: AWS Import/Export speeds up movement of large amounts of data into and out of AWS using portable storage devices for transport. Using Import/Export, AWS transfers data directly onto and off of storage devices using Amazon’s high-speed internal network, bypassing the Internet. For very large data sets, AWS Import/Export is often faster than Internet transfer and more cost effective.
Import/Export is one of the ways to collect vast amounts of data and put it on to AWS infrastructure for further processing.
Amazon Simple Storage Service (S3): Amazon S3 is storage for the Internet. It is designed to make web-scale computing easy and reliable.
Amazon S3 provides a simple web services interface that can be used to store and retrieve any amount of data, at any time, from anywhere on the web.
Amazon S3 is an ideal way to store large amount of data for analysis because of it’s reliability and cost effectiveness.
Apache Hadoop file systems can be hosted on S3, as its requirements of a file system are met by S3. As a result, Hadoop can be used to run MapReduce algorithms on EC2 servers, reading data and writing results back to S3.
Amazon Glacier: Amazon Glacier is an extremely low-cost storage service that provides secure and durable storage for data archiving and backup.
Any data which is processed or is no longer required to be accessed can be archived using Amazon Glacier.
AWS Storage Gateway: AWS Storage gateway is a service that connects your on-premises software appliance with Amazon S3 to provide seamless and secure integration between on-premise storage and S3. The service allows you to securely store data in the AWS cloud for scalable and cost-effective storage.
With AWS Storage Gateway, it is now possible to move data generated on premise to AWS Cloud for storage and processing in an automated and reliable manner.
Amazon Relational Database Service (RDS): Amazon RDS is a managed service that makes it easy to setup, operate and scale a relational database on AWS infrastructure. AWS RDS currently supports MYSQL, Oracle and MS SQL Server relational database technologies.
If you require a relational database to store large amount of data, you can use Amazon RDS.
Amazon DynamoDB: DynamoDB is a fully managed NoSQL database service by AWS. DynamoDB is a fast, highly reliable and cost-effective NoSQL database service designed for internet scale applications. It is designed to provide fast performance at any scale.
Amazon Redshift: Redshift is fully managed, petabyte-scale data warehouse service by AWS. RedShift is designed for analytic workloads and connects to standard SQL-based clients and business intelligence tools. Redshift delivers fast query and I/O performance for virtually any size dataset by using columnar storage technology and parallelizing and distributing queries across multiple nodes.
Data Analytics & Computation
Amazon EMR: Amazon EMR is a managed Hadoop distribution by Amazon Web Services. Amazon EMR helps users to analyze & process large amount of data by distributing data computation across multiple nodes in a cluster on AWS Cloud.
Amazon EMR uses a customized Apache Hadoop framework to achieve large scale distributed processing of data. Hadoop framework uses distributed data processing architecture known as MapReduce.
All the open source projects that work with Apache Hadoop also work seamlessly with Amazon EMR. In addition to this Amazon EMR is well integrated with various AWS services like EC2 (used to launch master and slave nodes), S3 (used as an alternative to HDFS), CloudWatch (monitor jobs on EMR), Amazon RDS, DynamoDB etc.
Amazon EC2: EC2 provides resizable computing capacity in the Amazon Web Services (AWS) cloud. EC2 allows scalable deployment of applications by providing a Web service through which a user can boot an Amazon Machine Image to create a virtual machine, which Amazon calls an “instance”, containing any software desired.
EC2 can be used to launch as many or as few virtual servers as you need to analyze your data. Amazon EC2 enables you to scale up or down to handle changes in requirements or spikes in popularity, reducing your need to forecast traffic.
Data Collaboration & Sharing
Once you’ve analyzed and processed your data, you need to share data with various teams and stakeholders to make the most out of it.
Collaboration and sharing can happen in multiple ways, for example generating reports using a BI tool or sharing it using a certain application or storing it in flat files to be picked up by some other processes to consume.
For collaboration and storing you can use AWS Services like S3, EC2, RDS, Redshift, DynamoDB among others to ensure that data is available to the end users/consumers of data in the format they require.
AWS Data Pipeline: The large amount of data generated by various sources, require capabilities to move the data and process the data using various tools and services. Managing migration and processing of large amount of data frequently is a tedious activity, which requires high level of automation with continuous monitoring.
AWS Data Pipeline web services gives an easy, automated solution to move data from multiple sources both within AWS and outside AWS and transform data. Data pipeline is a highly scalable and fully managed service.
With AWS Data pipeline, it is fast and easy to provision pipelines to move and transform data, which saves development efforts and maintenance over heads.
Check out our recent blogs on Big Data, for Big Data Challenges & its major opportunities.