top

Search

AWS Tutorials

Before getting into the creation of cluster with the help of EMR, it is essential to know what Hadoop is. It is an open-source framework which was developed by the Apache Software Foundation. It serves the purpose of efficient data storage, great data processing power, handling any number of concurrent tasks and running applications on clusters, wherein these clusters belong to a commodity hardware.IntroductionProcessing big data and analysing massive amounts of data as well as running multiple clusters at one point in time is usually the responsibility of Hadoop, but there are certain issues with installing Hadoop, integrating it with other software, improvising the parameters and configuring certain parts of it. This is when Amazon EMR comes into play.Amazon EMR is Elastic Map Reduce which is a cloud service that has been built to manage, analyse and process large amounts of data (aka big data). EMR makes sure that configuration required to manage big data is minimal thereby avoiding the use of complicated in-house cluster computing.EMR is based on Apache Hadoop, since it comes with support to process large datasets in a distributed environment. MapReduce is a framework which helps run code to process large, unstructured data in parallel over a distributed cluster of processors. MapReduce can also be used to work with stand-alone machines.The term ‘elastic’ refers to the ability of the service to scale itself up and down, depending on the requirement and usage at every point in time.Hadoop ArchitectureIt uses a master-worker architecture, wherein master is responsible in coordinating tasks such as scheduling, assigning tasks to nodes, and checking the progress. The worker node is the one which works and is responsible for processing and storing the data. EMR supports provisioning of multiple master nodes, thereby providing high availability as well as data backup and recovery facilities.The master and worker EC2 instances need to be in the ‘running’ state for the cluster to work.PrerequisitesCreating an Amazon S3 bucket: The output folder should be empty and make sure that the bucket location is correct. The names for bucket and folder should have letters, numbers, hyphens, and periods, and shouldn’t end with a number.Once you have the bucket in place, select this bucket from the list of buckets, ‘Create folder’, replace the ‘New folder’ with a name that fulfils the above requirements, and click on ‘Save’.Creating an Amazon EC2 key pair: An Amazon EC2 pair needs to already be present or it should be created. This will help in connecting the nodes to the cluster through a secure channel using SSH (Secure Shell) protocol.  Creating an EMR clusterFollowing are the steps to create an EMR Cluster: Amazon Management Console, and open Amazon EMR console.Click on the ‘Create cluster’ option:From here, there are 2 ways to specify details.Create cluster- quick options: Here the user accepts default values for all fields except for the ‘Cluster name’ and choosing the EC2 key pair in the ‘Security and access’ tab.From the quick options page, go to the ‘go to advanced options’ and specify more details regarding the cluster.In the ‘Advanced Options’ tab, different software with respect to the EMR cluster can be selected and installed, depending on the data requirements. For example- for an SQL interface, Hive can be selected and installed, whereas ZooKeeper can be used to coordinate between distributed applications.Click on the ‘Next’ button to select a hardware for the EMR cluster.Note: In the previous step, option to select big data processing jobs can be added.Two kinds of nodes can be used in EMR clusters: Core and Task.The ‘Core’ node is used to store as well as process data, where the ‘Task’ node is used to process the data. When ‘Task’ node is also used, it is charged for, based on its usage.Hence, if the requirement is simple, it is strongly suggested to use the ‘Core’ node only.Click on the ‘next’ button and specify the ‘Cluster name’ here.  Now, click on ‘Next’ button. It can be noticed that the ‘Termination Protection’ option is turned on by default, which makes sure that EMR cluster doesn’t get deleted accidentally. This option will add a few more steps before terminating the cluster.  Here, different security options to safeguard the EMR cluster are specified. Click on the ‘KeyPair’ to log in to the EC2 instance.  EMR creates required roles and security groups as well as attaching them to the master and worker EC2 nodes.  Click on the ‘Create Cluster’ option.  Creation of cluster takes a few minutes since EC2 instances need to be up, and specific software which were selected previously need to be installed and configured.  The cluster goes from being in ‘Provisioning’ stage to ‘Bootstrapping’ to ‘Waiting’ state. The ‘Waiting’ state basically refers to the EMR waiting for the user to submit a big data processing job (like Map Reduce, Hive, Spark).This can be done by going to the ‘Steps’ tab and clicking on ‘Add Step’. Here a type of step can be selected from the given options (MapReduce, Hive, Spark). Note: Amazon EMR also has the option of cloning a previously terminated cluster. Just go to the EMR console and metadata of a terminated cluster can be found here. From the time it was deleted to 2 months, the storage of such terminated clusters is free of cost.  The cluster status page contains the ‘Summary’ of the cluster, which can be used to keep a tab on the progress while creating a cluster. It can also be used to look at the details of the status of the cluster. As and when a cluster task is completed, the items present on this status page are updated. The refresh icon needs to be constantly clicked on to stay up-to-date.  Conclusion In this post, we understood what Apache Hadoop is, what Elastic MapReduce is, and how Amazon EMR can be used to create a Hadoop cluster.
logo

AWS Tutorials

How to Create EMR Cluster?

Before getting into the creation of cluster with the help of EMR, it is essential to know what Hadoop is. It is an open-source framework which was developed by the Apache Software Foundation. It serves the purpose of efficient data storage, great data processing power, handling any number of concurrent tasks and running applications on clusters, wherein these clusters belong to a commodity hardware.

Introduction

Processing big data and analysing massive amounts of data as well as running multiple clusters at one point in time is usually the responsibility of Hadoop, but there are certain issues with installing Hadoop, integrating it with other software, improvising the parameters and configuring certain parts of it. This is when Amazon EMR comes into play.

Amazon EMR is Elastic Map Reduce which is a cloud service that has been built to manage, analyse and process large amounts of data (aka big data). EMR makes sure that configuration required to manage big data is minimal thereby avoiding the use of complicated in-house cluster computing.

EMR is based on Apache Hadoop, since it comes with support to process large datasets in a distributed environment. MapReduce is a framework which helps run code to process large, unstructured data in parallel over a distributed cluster of processors. MapReduce can also be used to work with stand-alone machines.

The term ‘elastic’ refers to the ability of the service to scale itself up and down, depending on the requirement and usage at every point in time.

Hadoop Architecture

It uses a master-worker architecture, wherein master is responsible in coordinating tasks such as scheduling, assigning tasks to nodes, and checking the progress. The worker node is the one which works and is responsible for processing and storing the data. EMR supports provisioning of multiple master nodes, thereby providing high availability as well as data backup and recovery facilities.

The master and worker EC2 instances need to be in the ‘running’ state for the cluster to work.

Prerequisites

  • Creating an Amazon S3 bucket: 

The output folder should be empty and make sure that the bucket location is correct. The names for bucket and folder should have letters, numbers, hyphens, and periods, and shouldn’t end with a number.

Once you have the bucket in place, select this bucket from the list of buckets, ‘Create folder’, replace the ‘New folder’ with a name that fulfils the above requirements, and click on ‘Save’.

  • Creating an Amazon EC2 key pair: 

An Amazon EC2 pair needs to already be present or it should be created. This will help in connecting the nodes to the cluster through a secure channel using SSH (Secure Shell) protocol.  

Creating an EMR cluster

Following are the steps to create an EMR Cluster: 

  • Amazon Management Console, and open Amazon EMR console.
  • Click on the ‘Create cluster’ option:

Creating Hadoop cluster with the help of EMR

  • From here, there are 2 ways to specify details.
  1. Create cluster- quick options: Here the user accepts default values for all fields except for the ‘Cluster name’ and choosing the EC2 key pair in the ‘Security and access’ tab.
  2. From the quick options page, go to the ‘go to advanced options’ and specify more details regarding the cluster.
  • In the ‘Advanced Options’ tab, different software with respect to the EMR cluster can be selected and installed, depending on the data requirements. For example- for an SQL interface, Hive can be selected and installed, whereas ZooKeeper can be used to coordinate between distributed applications.

Creating Hadoop cluster with the help of EMR

  • Click on the ‘Next’ button to select a hardware for the EMR cluster.
    Note: In the previous step, option to select big data processing jobs can be added.
  • Two kinds of nodes can be used in EMR clusters: Core and Task.
  • The ‘Core’ node is used to store as well as process data, where the ‘Task’ node is used to process the data. When ‘Task’ node is also used, it is charged for, based on its usage.
    Hence, if the requirement is simple, it is strongly suggested to use the ‘Core’ node only.

Creating Hadoop cluster with the help of EMR

  • Click on the ‘next’ button and specify the ‘Cluster name’ here.  
  • Now, click on ‘Next’ button. It can be noticed that the ‘Termination Protection’ option is turned on by default, which makes sure that EMR cluster doesn’t get deleted accidentally. This option will add a few more steps before terminating the cluster.  
  • Here, different security options to safeguard the EMR cluster are specified. Click on the ‘KeyPair’ to log in to the EC2 instance.  
  • EMR creates required roles and security groups as well as attaching them to the master and worker EC2 nodes.  
  • Click on the ‘Create Cluster’ option.  
  • Creation of cluster takes a few minutes since EC2 instances need to be up, and specific software which were selected previously need to be installed and configured.  
  • The cluster goes from being in ‘Provisioning’ stage to ‘Bootstrapping’ to ‘Waiting’ state. The ‘Waiting’ state basically refers to the EMR waiting for the user to submit a big data processing job (like Map Reduce, Hive, Spark).
  • This can be done by going to the ‘Steps’ tab and clicking on ‘Add Step’. 

  • Here a type of step can be selected from the given options (MapReduce, Hive, Spark). 

Note: Amazon EMR also has the option of cloning a previously terminated cluster. Just go to the EMR console and metadata of a terminated cluster can be found here. From the time it was deleted to 2 months, the storage of such terminated clusters is free of cost.  

The cluster status page contains the ‘Summary’ of the cluster, which can be used to keep a tab on the progress while creating a cluster. It can also be used to look at the details of the status of the cluster. As and when a cluster task is completed, the items present on this status page are updated. The refresh icon needs to be constantly clicked on to stay up-to-date.  

Conclusion 

In this post, we understood what Apache Hadoop is, what Elastic MapReduce is, and how Amazon EMR can be used to create a Hadoop cluster.

Leave a Reply

Your email address will not be published. Required fields are marked *