Amazon EMR(Elastic MapReduce) is a cloud-based big data platform that allows the team to quickly process large amounts of data at an effective cost. For this, they use open source tools like Apache Hive, Apache Spark, Apache Flink, Apache HBase, and Presto. With the help of Amazon S3’s scalable storage and Amazon EC2’s dynamic stability, EMR provides the elasticity and engines for running Petabyte-scale analysis. The cost of this is just a fraction of the traditional on-premise clusters’ cost. For iterative collaboration, development, and data access across data products like Amazon DynamoDB, Amazon S3, and Amazon Redshift, you can use Jupyter-based EMR Notebooks. It helps in reducing time for insight and operationalizing analytics quickly.
Several customers use EMR for reliably and securely handling the big data use cases like machine learning, deep learning, bioinformatics, financial and scientific stimulation, log analysis, and data transformations (ETL). With EMR, the team has the flexibility of running use cases on short lived, single-purpose clusters or highly available, long running clusters.
Since clusters are launched in minutes by EMR, you don’t have to worry about infrastructure setup, node provisioning, cluster tuning, and Hadoop configuration. All these tasks are taken care of by EMR so that you can concentrate on analysis. Data engineers, data scientists, and data analysts can use the EMR notebooks for launching a serverless Jupyter notebook within a matter of seconds. This also allows the team and individuals to interactively explore, visualize and process the data.
The pricing of EMR is simple as well as predictable. There is a one-minute minimum charge and the rest is paid according to per-instance rate for every second. You can use applications like Apache Hive and Apache Spark for launching a 10-node EMR cluster for a low cost of $0.15 per hour. Also, EMR has native support for Reserved instances and Amazon EC2 spot, which can help you save on the cost of the underlying instances by about 50-80%. The pricing of Amazon EMR depends on the number of deployed EC2 instances, the type of the instance and the region where you are launching your cluster. Since it is on-demand pricing, you can expect low rates. But for reducing the cost even further, you can purchase Spot instances or reserved instances. The cost of spot instanced is about one-tenth less than the on-demand pricing. Remember that if you are using services like Amazon D3, DynamoDB or Amazon Kinesis along with your EMR cluster, they will be charged separately from the usage for Amazon EMR.
EMR allows provisioning of not one but thousands of compute instances for processing data at any scale. All these instances’ numbers can be decreased or increased manually or automatically with the help of Auto Scaling which can manage the size of clusters on the basis of utilization. This allows you to pay only for what you use. Also, unlike the on-premise clusters of the rigid infrastructure, EMR decouples persistent storage and compute which gives you the ability of scaling every one of them independently.
Thanks to EMR, you can now spend less time monitoring and tuning your cluster. Tuned for the cloud, EMT monitors your cluster constantly. They retry failed tasks and replace poorly performed instances automatically. Also, you don’t have to manage bug fixes and updates as EMR provides the latest stable releases of open source software. This results in lesser efforts and fewer issues in maintaining the environment. With the help of multiple master nodes, clusters are not only highly available, but also failover in case of a node failure automatically.
With Amazon EMR, you have a configuration option for controlling the termination of your cluster, whether you do it manually or automatically. If you go for the option of automatic termination, the cluster will be terminated once the steps are completed. This is known as a transient cluster. However, if you go for the manual option, the cluster will continue to run even after the processing is completed. You will have to manually terminate it when you no longer need it. The other option is creating a cluster, interacting directly with the installed applications, and then manually terminating the cluster. These are known as long-running clusters.
Also, there is an option of configuring the termination protection for preventing the clusters’ instances from being terminated due to issues and errors during processing. This allows recovery of instances’ data before they are terminated. These options’ default settings depend on whether you launched your cluster with the console, API or CLI.
EMR is responsible for automatically configuring the firewall settings of EC2. these setting control and instances’ network access and launches the clusters in an Amazon VPC. For all the objects residing in S3, client-side or server-side encryption is used along with EMRFS, which is an object store on S3 for Hadoop. To achieve this, you can either use your own customer-managed keys or the AWS Key Management Service. With the help of EMR, you can easily enable other encryption options like at-rest and in-transit encryption.
Amazon EMR can leverage AWS services like Amazon VPC and IAM and features like Amazon EC2 key pairs for securing the cluster and data. Let’s go through these leverages one by one:
When integrated with IAM, Amazon EMR allows managing of permissions. You can use the IAM policies for defining the permissions which are then attached to the IAM groups or IAM users. The defined permissions determine the actions the members of the group or the users can perform and accessible resources.
Apart from this, IAM roles are used by the Amazon EMR for Amazon EMR service itself as well as EC2 instance profile. These roles can grant permissions for accessing other AWS services. There is a default role for EC2 instance profile and Amazon EMR service. The AWS managed policies are used by the default role which is automatically created when you launch the EMR cluster from the console for the first time and select default permissions. You can use the AWS CLI for creating default IAM roles. For managing permissions, you can select custom roles for instance and the service profile.
Security groups are used by Amazon EMR for controlling outbound and inbound traffic to the EC2 instances. When you are launching the cluster, a security group for master instance and to be shared by the task/core instance is used. The security group rules are configured by the Amazon EMR for ensuring communication between the instances. Apart from this, there is an option for configuring additional security groups and assigning them to the master as well as task/core instances for advanced rules.
The Amazon S3 client-side and server-side encryption along with EMRFS is supported by the Amazon EMR, this allows protecting the data stored in Amazon S3. The server-side encryption allows encrypting the data after you have uploaded it. The client-side encryption allows encrypting the decrypting on the EMR cluster in the EMRFS client. You can use the AWS Key Management Service for managing the master key for the client-side encryption.
You can launch clusters in a Virtual Private Cloud (VPC). A VPC is a virtual network isolated in the AWS providing the ability to control network access and configuration’s advanced aspects.
When integrated with CloudTrail, Amazon EMR allows logging information regarding request made by the AWS account. You can use this information to track who is accessing the cluster and when and can even determine the IP address that made the request.
A secure connection needs to be formed between the master node and your remote computer for monitoring and interacting with the cluster. For the connection, you can use the Secure Shell (SSH) network and for authentication, you can use Kerberos. An Amazon EC2 key pair will be required, if you are using SSH.
EMR allows you to have complete control over the cluster. This involves easy installation of additional applications, having root access to every instance, and customizing every cluster with bootstrap actions. Also, you won’t have to re-launch the cluster for reconfiguring the running clusters on the fly or using the custom Amazon Linux AMIs for launching EMR clusters.
Also, you have the option of scaling up or down your clusters according to your computing needs. You can remove instances for controlling costs when peak workloads subside or add instances for peak overloads by resizing your clusters.
Amazon EMR also allows running multiple instance groups so that on-demand instanced can be used in a single group for processing power with spot instances in other group. This helps faster completion of jobs at a lower price. You can even take advantage of low price on one spot instance type over another by mixing different types of instances together.
Amazon EMR offers the flexibility of using different file systems for your input, intermediate and output data. For example:
Integrating Amazon EMR with other services offered by the AWS can help in providing functionalities and capabilities of networking, security, storage and many more. Here are some of the examples of such integration:
The EMR clusters have EC2 instances which are responsible for performing the work that you are submitting to the cluster. When you are launching the cluster, the instances with the applications like Apache Spark or Apache Hadoop are configured by the Amazon EMR. You need to select the type and size of the instance that suits the cluster’s processing needs including streaming data, batch processing, large data storage, and low-latency queries. There are different ways of configuring the software on your cluster provided by the Amazon EMR. For example:
Troubleshooting of Cluster issues like errors or failures can be done by using the log files and Amazon EMR Management Interface. You will have the capability of archiving log files in Amazon S3 for storing log and troubleshoot issues even after the cluster has been terminated. There is also an optional debugging tool available in the Amazon EMR console that can be used for browsing log files based on tasks, jobs and steps.
CloudWatch is integrated with Amazon EMR for tracking performance metrics for the cluster as well as the jobs within the cluster. Configuration of alarms is done based on metrics like what is the percentage of used storage or if the cluster is idle or not.
There are different ways for interacting with the Amazon EMR including the following:
This is a graphical user interface that can be used for launching and managing clusters. You need to specify the details of the cluster to be launched and check out the details of the existing clusters, terminated clusters and debug by filling out web forms. It is the easiest way to start working with Amazon EMR as no programming knowledge is required. You can get the console online from here.
This is a client application that you can run on your local machine for connecting to the Amazon EMR and creating and managing clusters. There is a set of commands available in the AWS CLI for the Amazon EMR, you can use this for writing scripts that can automate the launch and management of the cluster.
There are functions available in the SDKs that can call Amazon EMR for creating and managing clusters. You can even write applications for automating this process. It is the best way of extending and customizing the Amazon EMR’s functionality. The available SDKs for the Amazon EMR are Java, Go, PHP, Python, .NET, Ruby, and Node.js.
This is a low-level interface that uses JSON for calling the Amazon EMR directly. This can be used for creating a customized SDK that calls the web service. Now that we have discussed the benefits of EMR, let’s move on to the EMR use cases:
EMR provides built-in machine learning tools for scalable machine learning algorithms like TensorFLow, Apache Spark MLib, and Apache MXNet. Also you can easily use Bootstrap Actions and Custom AMIs for easily adding the preferred tools and libraries for creating your very own predictive analytics toolset.
For cost-effective and quick performance of data transformation workloads (ETL) like sort, join and aggregate on large datasets, you can use EMR.
With EMR, along with Apache Hive and Apache Spark, you can segment users, deliver effective ads by understanding the user preferences. All this can be achieved by analyzing the clickstream data from Amazon S3.
With EMR and Amazon Spark Streaming, analyzing events from Amazon Kinesis, Amazon Kafka or any other streaming data source is possible. This helps in creating highly available, long running, and fault-tolerant streaming data pipelines. Persist transformed insights to Amazon Elasticsearch and datasets to HDFS or Amazon S3.
With EMR Notebooks, you will be provided with an open-source Jupyter based, managed analytic environment. This will allow data analysts, developers and scientists in preparing and visualizing data, collaborating with peers, building applications, and performing interactive analysis.
EMR can also be used for quickly and efficiently processing large amounts of genomic data or any other large, scientific dataset. Genomic data hosted on AWS can be accessed by researchers for free.
In this article, you got a quick introduction to Amazon EMR and how it has different log files’ types. Also, you got to understand the benefits of Elastic MapReduce. To become an expert in AWS services, enroll in the AWS certification course offered by KnowledgeHut.
Your email address will not be published. Required fields are marked *
A solution architect is an AWS solutions Architect... Read More
AWS has dominated the field of cloud computing in ... Read More