Organizations generate large amount of data, and require capabilities to move the data and process the data using various tools and services. Managing migration and processing of large amount of data frequently is a tedious activity, which requires high level of automation with continuous monitoring.
For example, an organization has multiple web servers deployed on premise and on Cloud, logs generated by these servers have to be processed periodically. Such an activity would require consolidation of log files from multiple different sources, put it into central location and then process it.
Amazon Web Services data pipeline web services gives an easy, automated solution to move data from multiple sources both within AWS and outside AWS and transform data. Data pipeline is a highly scalable and fully managed service.
AWS Data pipeline allows users to define a dependent chain of data sources and destinations with an option to create data processing activities called pipeline. The tasks within a pipeline can be scheduled to perform various activities of data movement and processing. In addition to scheduling, you can also have failure and retry options included in the data pipeline workflows.
With AWS Data pipeline, it is fast and easy to provision pipelines to move and transform data, which saves development efforts and maintenance over heads.
While creating a pipeline, you need to create activities, data nodes, schedule and preconditions for activities.
Activities are actions that data pipeline executes. Activities currently supported by data pipeline include:
- Copy Activity – A copy activity will copy data between S3 buckets and between S3 & JDBC sources.
- EMR Activity – An EMR Activity allows you to run Amazon EMR jobs
- Hive Activity- A Hive Activity will execute Hive queries
- Shell Command Activity – A Shell Command activity allows you to run shell scripts or commands
Data node is a representation of your data. Data pipeline currently supports the following data sources:
- S3 Bucket
- MySQL DB
- SQL Data Source
Data pipeline allows you to schedule the activities defined in your pipeline. You can define individual schedules for all your activities.
Precondition is a check that can be optionally associated with a data node or an activity. The precondition check for an activity must be complete before an activity is executed. There are certain pre-defined preconditions available on data pipeline:
- DynamoDBDataExists – This precondition checks existence of data in a DynamoDB table
- DynamoDBTableExists – This precondition checks for the existence of a DynamoDB table
- RDSSqlPrecondition – This precondition runs a query against a RDS database and validates if the query output matches the expected results
- S3KeyExists – Checks for existence of a specific Amazon S3 path
- S3PrefixExists – Check for existence of at least one file within a specific path
- ShellCommandPrecondition – This precondition executes a shell script to check if it completes successfully
Data pipeline is a useful tool if you rely heavily on Amazon Web Services for storing and managing your data. The advantages for using it on AWS are clear; you can save a lot of time by using the automated workflows to manage transformation of your data.
If you need help on data management on AWS or are looking for expert advice on Data pipeline, contact us at email@example.com.