AWS Tutorials

By KnowledgeHut .

Before getting into the creation of cluster with the help of EMR, it is essential to know what Hadoop is. It is an open-source framework which was developed by the Apache Software Foundation. It serves the purpose of efficient data storage, great data processing power, handling any number of concurrent tasks and running applications on clusters, wherein these clusters belong to a commodity hardware.IntroductionProcessing big data and analysing massive amounts of data as well as running multiple clusters at one point in time is usually the responsibility of Hadoop, but there are certain issues with installing Hadoop, integrating it with other software, improvising the parameters and configuring certain parts of it. This is when Amazon EMR comes into play.Amazon EMR is Elastic Map Reduce which is a cloud service that has been built to manage, analyse and process large amounts of data (aka big data). EMR makes sure that configuration required to manage big data is minimal thereby avoiding the use of complicated in-house cluster computing.EMR is based on Apache Hadoop, since it comes with support to process large datasets in a distributed environment. MapReduce is a framework which helps run code to process large, unstructured data in parallel over a distributed cluster of processors. MapReduce can also be used to work with stand-alone machines.The term ‘elastic’ refers to the ability of the service to scale itself up and down, depending on the requirement and usage at every point in time.Hadoop ArchitectureIt uses a master-worker architecture, wherein master is responsible in coordinating tasks such as scheduling, assigning tasks to nodes, and checking the progress. The worker node is the one which works and is responsible for processing and storing the data. EMR supports provisioning of multiple master nodes, thereby providing high availability as well as data backup and recovery facilities.The master and worker EC2 instances need to be in the ‘running’ state for the cluster to work.PrerequisitesCreating an Amazon S3 bucket: The output folder should be empty and make sure that the bucket location is correct. The names for bucket and folder should have letters, numbers, hyphens, and periods, and shouldn’t end with a number.Once you have the bucket in place, select this bucket from the list of buckets, ‘Create folder’, replace the ‘New folder’ with a name that fulfils the above requirements, and click on ‘Save’.Creating an Amazon EC2 key pair: An Amazon EC2 pair needs to already be present or it should be created. This will help in connecting the nodes to the cluster through a secure channel using SSH (Secure Shell) protocol. Creating an EMR clusterFollowing are the steps to create an EMR Cluster: Amazon Management Console, and open Amazon EMR console.Click on the ‘Create cluster’ option:From here, there are 2 ways to specify details.Create cluster- quick options: Here the user accepts default values for all fields except for the ‘Cluster name’ and choosing the EC2 key pair in the ‘Security and access’ tab.From the quick options page, go to the ‘go to advanced options’ and specify more details regarding the cluster.In the ‘Advanced Options’ tab, different software with respect to the EMR cluster can be selected and installed, depending on the data requirements. For example- for an SQL interface, Hive can be selected and installed, whereas ZooKeeper can be used to coordinate between distributed applications.Click on the ‘Next’ button to select a hardware for the EMR cluster.Note: In the previous step, option to select big data processing jobs can be added.Two kinds of nodes can be used in EMR clusters: Core and Task.The ‘Core’ node is used to store as well as process data, where the ‘Task’ node is used to process the data. When ‘Task’ node is also used, it is charged for, based on its usage.Hence, if the requirement is simple, it is strongly suggested to use the ‘Core’ node only.Click on the ‘next’ button and specify the ‘Cluster name’ here. Now, click on ‘Next’ button. It can be noticed that the ‘Termination Protection’ option is turned on by default, which makes sure that EMR cluster doesn’t get deleted accidentally. This option will add a few more steps before terminating the cluster. Here, different security options to safeguard the EMR cluster are specified. Click on the ‘KeyPair’ to log in to the EC2 instance. EMR creates required roles and security groups as well as attaching them to the master and worker EC2 nodes. Click on the ‘Create Cluster’ option. Creation of cluster takes a few minutes since EC2 instances need to be up, and specific software which were selected previously need to be installed and configured. The cluster goes from being in ‘Provisioning’ stage to ‘Bootstrapping’ to ‘Waiting’ state. The ‘Waiting’ state basically refers to the EMR waiting for the user to submit a big data processing job (like Map Reduce, Hive, Spark).This can be done by going to the ‘Steps’ tab and clicking on ‘Add Step’. Here a type of step can be selected from the given options (MapReduce, Hive, Spark). Note: Amazon EMR also has the option of cloning a previously terminated cluster. Just go to the EMR console and metadata of a terminated cluster can be found here. From the time it was deleted to 2 months, the storage of such terminated clusters is free of cost. The cluster status page contains the ‘Summary’ of the cluster, which can be used to keep a tab on the progress while creating a cluster. It can also be used to look at the details of the status of the cluster. As and when a cluster task is completed, the items present on this status page are updated. The refresh icon needs to be constantly clicked on to stay up-to-date. Conclusion In this post, we understood what Apache Hadoop is, what Elastic MapReduce is, and how Amazon EMR can be used to create a Hadoop cluster.

1. Introduction

2. Amazon Web Services S3

3. AWS Relational Database

4. Amazon Elastic Block Store (EBS)

5. Amazon DynamoDB

6. Amazon Lightsail

7. Creating Hadoop cluster with the help of EMR

8. AWS AMI

9. Amazon Web Services Lambda

10. Amazon Route 53

11. AWS Bastion Host

12. Amazon Athena

13. AWS CloudWatch Monitoring

14. AWS CloudFront

15. AWS Autoscaling and Workspaces

16. AWS IAM (Identity and Access Management)

17. Amazon ElastiCache

18. AWS networking

19. Amazon Aurora

20. Amazon Code deploy

21. AWS NACL

22. Difference between NACL and security groups

23. AWS Direct connect

24. AWS Storage gateway

25. Security and Identity

26. Cloud security

27. AWS workmail

28. Creation of an RDS instance in AWS

29. AWS API Gateway

30. Amazon SQS

31. AWS Machine Learning

32. Database

33. AWS non-relational database

34. AWS Snowball

35. AWS Fargate

36. AWS EKS

37. AWS Codecommit

38. AWS Opswork intro

39. AWS simple workflow service

40. AWS NAT Gateway

41. AWS Compliance

42. AWS Codestar

43. Network Load Balancer

44. AWS Bash Script

45. AWS Analytics

46. VPC Endpoint and Flowlogs

47. AWS S3 Transfer acceleration

48. Amazon IoT

49. AWS Versioning

50. Creating IAM roles

51. AWS SAML

52. Amazon VPC tutorial

53. AWS Mobile Services

54. AWS Cross region replication

55. AWS Compute

56. AWS Lifecycle Management

How to Create EMR Cluster?

Introduction

Processing big data and analysing massive amounts of data as well as running multiple clusters at one point in time is usually the responsibility of Hadoop, but there are certain issues with installing Hadoop, integrating it with other software, improvising the parameters and configuring certain parts of it. This is when Amazon EMR comes into play.

Amazon EMR is Elastic Map Reduce which is a cloud service that has been built to manage, analyse and process large amounts of data (aka big data). EMR makes sure that configuration required to manage big data is minimal thereby avoiding the use of complicated in-house cluster computing.

EMR is based on Apache Hadoop, since it comes with support to process large datasets in a distributed environment. MapReduce is a framework which helps run code to process large, unstructured data in parallel over a distributed cluster of processors. MapReduce can also be used to work with stand-alone machines.

The term ‘elastic’ refers to the ability of the service to scale itself up and down, depending on the requirement and usage at every point in time.

Hadoop Architecture

It uses a master-worker architecture, wherein master is responsible in coordinating tasks such as scheduling, assigning tasks to nodes, and checking the progress. The worker node is the one which works and is responsible for processing and storing the data. EMR supports provisioning of multiple master nodes, thereby providing high availability as well as data backup and recovery facilities.

The master and worker EC2 instances need to be in the ‘running’ state for the cluster to work.

Prerequisites

Creating an Amazon S3 bucket:

The output folder should be empty and make sure that the bucket location is correct. The names for bucket and folder should have letters, numbers, hyphens, and periods, and shouldn’t end with a number.

Once you have the bucket in place, select this bucket from the list of buckets, ‘Create folder’, replace the ‘New folder’ with a name that fulfils the above requirements, and click on ‘Save’.

Creating an Amazon EC2 key pair:

An Amazon EC2 pair needs to already be present or it should be created. This will help in connecting the nodes to the cluster through a secure channel using SSH (Secure Shell) protocol.

Creating an EMR cluster

Following are the steps to create an EMR Cluster:

Amazon Management Console, and open Amazon EMR console.
Click on the ‘Create cluster’ option:

Creating Hadoop cluster with the help of EMR

From here, there are 2 ways to specify details.

Create cluster- quick options: Here the user accepts default values for all fields except for the ‘Cluster name’ and choosing the EC2 key pair in the ‘Security and access’ tab.
From the quick options page, go to the ‘go to advanced options’ and specify more details regarding the cluster.

In the ‘Advanced Options’ tab, different software with respect to the EMR cluster can be selected and installed, depending on the data requirements. For example- for an SQL interface, Hive can be selected and installed, whereas ZooKeeper can be used to coordinate between distributed applications.

Creating Hadoop cluster with the help of EMR

Click on the ‘Next’ button to select a hardware for the EMR cluster.
Note: In the previous step, option to select big data processing jobs can be added.
Two kinds of nodes can be used in EMR clusters: Core and Task.
The ‘Core’ node is used to store as well as process data, where the ‘Task’ node is used to process the data. When ‘Task’ node is also used, it is charged for, based on its usage.
Hence, if the requirement is simple, it is strongly suggested to use the ‘Core’ node only.

Creating Hadoop cluster with the help of EMR

Click on the ‘next’ button and specify the ‘Cluster name’ here.
Now, click on ‘Next’ button. It can be noticed that the ‘Termination Protection’ option is turned on by default, which makes sure that EMR cluster doesn’t get deleted accidentally. This option will add a few more steps before terminating the cluster.
Here, different security options to safeguard the EMR cluster are specified. Click on the ‘KeyPair’ to log in to the EC2 instance.
EMR creates required roles and security groups as well as attaching them to the master and worker EC2 nodes.
Click on the ‘Create Cluster’ option.
Creation of cluster takes a few minutes since EC2 instances need to be up, and specific software which were selected previously need to be installed and configured.
The cluster goes from being in ‘Provisioning’ stage to ‘Bootstrapping’ to ‘Waiting’ state. The ‘Waiting’ state basically refers to the EMR waiting for the user to submit a big data processing job (like Map Reduce, Hive, Spark).
This can be done by going to the ‘Steps’ tab and clicking on ‘Add Step’.

Here a type of step can be selected from the given options (MapReduce, Hive, Spark).

Note: Amazon EMR also has the option of cloning a previously terminated cluster. Just go to the EMR console and metadata of a terminated cluster can be found here. From the time it was deleted to 2 months, the storage of such terminated clusters is free of cost.

The cluster status page contains the ‘Summary’ of the cluster, which can be used to keep a tab on the progress while creating a cluster. It can also be used to look at the details of the status of the cluster. As and when a cluster task is completed, the items present on this status page are updated. The refresh icon needs to be constantly clicked on to stay up-to-date.

Conclusion

In this post, we understood what Apache Hadoop is, what Elastic MapReduce is, and how Amazon EMR can be used to create a Hadoop cluster.

6-A How do you Set up AWS Lightsail Instance?

8-A How to use Amazon Machine Image (AMI)?

Your email address will not be published. Required fields are marked *

Comments

tenzin nyima

Whoever has contributed to this article...I would like to say thank you... it has been of good help to the readers.

alvi

This blog is very helpful and informative, and I really learned a lot from it.

alvi

It is very helpful and very informative, and I really learned a lot from this article.

alvi

Such a very useful article. I would like to thank you for the efforts you made in writing this awesome blog.

Jeanne

Very useful and awesome blog!

View More Comments

Search

AWS Tutorials

By KnowledgeHut .

AWS Tutorials

How to Create EMR Cluster?

Hadoop Architecture

Creating an EMR cluster

Leave a Reply

Comments

tenzin nyima

alvi

alvi

alvi

Jeanne