Before getting into the creation of cluster with the help of EMR, it is essential to know what Hadoop is. It is an open-source framework which was developed by the Apache Software Foundation. It serves the purpose of efficient data storage, great data processing power, handling any number of concurrent tasks and running applications on clusters, wherein these clusters belong to a commodity hardware.
Processing big data and analysing massive amounts of data as well as running multiple clusters at one point in time is usually the responsibility of Hadoop, but there are certain issues with installing Hadoop, integrating it with other software, improvising the parameters and configuring certain parts of it. This is when Amazon EMR comes into play.
Amazon EMR is Elastic Map Reduce which is a cloud service that has been built to manage, analyse and process large amounts of data (aka big data). EMR makes sure that configuration required to manage big data is minimal thereby avoiding the use of complicated in-house cluster computing.
EMR is based on Apache Hadoop, since it comes with support to process large datasets in a distributed environment. MapReduce is a framework which helps run code to process large, unstructured data in parallel over a distributed cluster of processors. MapReduce can also be used to work with stand-alone machines.
The term ‘elastic’ refers to the ability of the service to scale itself up and down, depending on the requirement and usage at every point in time.
It uses a master-worker architecture, wherein master is responsible in coordinating tasks such as scheduling, assigning tasks to nodes, and checking the progress. The worker node is the one which works and is responsible for processing and storing the data. EMR supports provisioning of multiple master nodes, thereby providing high availability as well as data backup and recovery facilities.
The master and worker EC2 instances need to be in the ‘running’ state for the cluster to work.
The output folder should be empty and make sure that the bucket location is correct. The names for bucket and folder should have letters, numbers, hyphens, and periods, and shouldn’t end with a number.
Once you have the bucket in place, select this bucket from the list of buckets, ‘Create folder’, replace the ‘New folder’ with a name that fulfils the above requirements, and click on ‘Save’.
An Amazon EC2 pair needs to already be present or it should be created. This will help in connecting the nodes to the cluster through a secure channel using SSH (Secure Shell) protocol.
Following are the steps to create an EMR Cluster:
Note: Amazon EMR also has the option of cloning a previously terminated cluster. Just go to the EMR console and metadata of a terminated cluster can be found here. From the time it was deleted to 2 months, the storage of such terminated clusters is free of cost.
The cluster status page contains the ‘Summary’ of the cluster, which can be used to keep a tab on the progress while creating a cluster. It can also be used to look at the details of the status of the cluster. As and when a cluster task is completed, the items present on this status page are updated. The refresh icon needs to be constantly clicked on to stay up-to-date.
In this post, we understood what Apache Hadoop is, what Elastic MapReduce is, and how Amazon EMR can be used to create a Hadoop cluster.