
Domains
Agile Management
Master Agile methodologies for efficient and timely project delivery.
View All Agile Management Coursesicon-refresh-cwCertifications
Scrum Alliance
16 Hours
Best Seller
Certified ScrumMaster (CSM) CertificationScrum Alliance
16 Hours
Best Seller
Certified Scrum Product Owner (CSPO) CertificationScaled Agile
16 Hours
Trending
Leading SAFe 6.0 CertificationScrum.org
16 Hours
Professional Scrum Master (PSM) CertificationScaled Agile
16 Hours
SAFe 6.0 Scrum Master (SSM) CertificationAdvanced Certifications
Scaled Agile, Inc.
32 Hours
Recommended
Implementing SAFe 6.0 (SPC) CertificationScaled Agile, Inc.
24 Hours
SAFe 6.0 Release Train Engineer (RTE) CertificationScaled Agile, Inc.
16 Hours
Trending
SAFe® 6.0 Product Owner/Product Manager (POPM)IC Agile
24 Hours
ICP Agile Certified Coaching (ICP-ACC)Scrum.org
16 Hours
Professional Scrum Product Owner I (PSPO I) TrainingMasters
32 Hours
Trending
Agile Management Master's Program32 Hours
Agile Excellence Master's ProgramOn-Demand Courses
Agile and ScrumRoles
Scrum MasterTech Courses and Bootcamps
Full Stack Developer BootcampAccreditation Bodies
Scrum AllianceTop Resources
Scrum TutorialProject Management
Gain expert skills to lead projects to success and timely completion.
View All Project Management Coursesicon-standCertifications
PMI
36 Hours
Best Seller
Project Management Professional (PMP) CertificationAxelos
32 Hours
PRINCE2 Foundation & Practitioner CertificationAxelos
16 Hours
PRINCE2 Foundation CertificationAxelos
16 Hours
PRINCE2 Practitioner CertificationSkills
Change ManagementMasters
Job Oriented
45 Hours
Trending
Project Management Master's ProgramUniversity Programs
45 Hours
Trending
Project Management Master's ProgramOn-Demand Courses
PRINCE2 Practitioner CourseRoles
Project ManagerAccreditation Bodies
PMITop Resources
Theories of MotivationCloud Computing
Learn to harness the cloud to deliver computing resources efficiently.
View All Cloud Computing Coursesicon-cloud-snowingCertifications
AWS
32 Hours
Best Seller
AWS Certified Solutions Architect - AssociateAWS
32 Hours
AWS Cloud Practitioner CertificationAWS
24 Hours
AWS DevOps CertificationMicrosoft
16 Hours
Azure Fundamentals CertificationMicrosoft
24 Hours
Best Seller
Azure Administrator CertificationMicrosoft
45 Hours
Recommended
Azure Data Engineer CertificationMicrosoft
32 Hours
Azure Solution Architect CertificationMicrosoft
40 Hours
Azure DevOps CertificationAWS
24 Hours
Systems Operations on AWS Certification TrainingAWS
24 Hours
Developing on AWSMasters
Job Oriented
48 Hours
New
AWS Cloud Architect Masters ProgramBootcamps
Career Kickstarter
100 Hours
Trending
Cloud Engineer BootcampRoles
Cloud EngineerOn-Demand Courses
AWS Certified Developer Associate - Complete GuideAuthorized Partners of
AWSTop Resources
Scrum TutorialIT Service Management
Understand how to plan, design, and optimize IT services efficiently.
View All DevOps Coursesicon-git-commitCertifications
Axelos
16 Hours
Best Seller
ITIL 4 Foundation CertificationAxelos
16 Hours
ITIL Practitioner CertificationPeopleCert
16 Hours
ISO 14001 Foundation CertificationPeopleCert
16 Hours
ISO 20000 CertificationPeopleCert
24 Hours
ISO 27000 Foundation CertificationAxelos
24 Hours
ITIL 4 Specialist: Create, Deliver and Support TrainingAxelos
24 Hours
ITIL 4 Specialist: Drive Stakeholder Value TrainingAxelos
16 Hours
ITIL 4 Strategist Direct, Plan and Improve TrainingOn-Demand Courses
ITIL 4 Specialist: Create, Deliver and Support ExamTop Resources
ITIL Practice TestData Science
Unlock valuable insights from data with advanced analytics.
View All Data Science Coursesicon-dataBootcamps
Job Oriented
6 Months
Trending
Data Science BootcampJob Oriented
289 Hours
Data Engineer BootcampJob Oriented
6 Months
Data Analyst BootcampJob Oriented
288 Hours
New
AI Engineer BootcampSkills
Data Science with PythonRoles
Data ScientistOn-Demand Courses
Data Analysis Using ExcelTop Resources
Machine Learning TutorialDevOps
Automate and streamline the delivery of products and services.
View All DevOps Coursesicon-terminal-squareCertifications
DevOps Institute
16 Hours
Best Seller
DevOps Foundation CertificationCNCF
32 Hours
New
Certified Kubernetes AdministratorDevops Institute
16 Hours
Devops LeaderSkills
KubernetesRoles
DevOps EngineerOn-Demand Courses
CI/CD with Jenkins XGlobal Accreditations
DevOps InstituteTop Resources
Top DevOps ProjectsBI And Visualization
Understand how to transform data into actionable, measurable insights.
View All BI And Visualization Coursesicon-microscopeBI and Visualization Tools
Certification
24 Hours
Recommended
Tableau CertificationCertification
24 Hours
Data Visualization with Tableau CertificationMicrosoft
24 Hours
Best Seller
Microsoft Power BI CertificationTIBCO
36 Hours
TIBCO Spotfire TrainingCertification
30 Hours
Data Visualization with QlikView CertificationCertification
16 Hours
Sisense BI CertificationOn-Demand Courses
Data Visualization Using Tableau TrainingTop Resources
Python Data Viz LibsCyber Security
Understand how to protect data and systems from threats or disasters.
View All Cyber Security Coursesicon-refresh-cwCertifications
CompTIA
40 Hours
Best Seller
CompTIA Security+EC-Council
40 Hours
Certified Ethical Hacker (CEH v12) CertificationISACA
22 Hours
Certified Information Systems Auditor (CISA) CertificationISACA
40 Hours
Certified Information Security Manager (CISM) Certification(ISC)²
40 Hours
Certified Information Systems Security Professional (CISSP)(ISC)²
40 Hours
Certified Cloud Security Professional (CCSP) Certification16 Hours
Certified Information Privacy Professional - Europe (CIPP-E) CertificationISACA
16 Hours
COBIT5 Foundation16 Hours
Payment Card Industry Security Standards (PCI-DSS) CertificationOn-Demand Courses
CISSPTop Resources
Laptops for IT SecurityWeb Development
Learn to create user-friendly, fast, and dynamic web applications.
View All Web Development Coursesicon-codeBootcamps
Career Kickstarter
6 Months
Best Seller
Full-Stack Developer BootcampJob Oriented
3 Months
Best Seller
UI/UX Design BootcampEnterprise Recommended
6 Months
Java Full Stack Developer BootcampCareer Kickstarter
490+ Hours
Front-End Development BootcampCareer Accelerator
4 Months
Backend Development Bootcamp (Node JS)Skills
ReactOn-Demand Courses
Angular TrainingTop Resources
Top HTML ProjectsBlockchain
Understand how transactions and databases work in blockchain technology.
View All Blockchain Coursesicon-stop-squareBlockchain Certifications
40 Hours
Blockchain Professional Certification32 Hours
Blockchain Solutions Architect Certification32 Hours
Blockchain Security Engineer Certification24 Hours
Blockchain Quality Engineer Certification5+ Hours
Blockchain 101 CertificationOn-Demand Courses
NFT Essentials 101: A Beginner's GuideTop Resources
Blockchain Interview QsProgramming
Learn to code efficiently and design software that solves problems.
View All Programming Coursesicon-codeSkills
Python CertificationInterview Prep
Career Accelerator
3 Months
Software Engineer Interview PrepOn-Demand Courses
Data Structures and Algorithms with JavaScriptTop Resources
Python TutorialIn this section we will get a detailed understanding on RDDs, as they are one of the most important features of Spark. A very clear understanding on this topic is essential to create a good Spark application. We will also understand how RDDs overcome the demerits of MapReduce processes.
Resilient Distributed Datasets
As we have already seen, RDDs are immutable, partitioned, distributed datasets used by Spark for data processing. They are also fault tolerant and can be recreated at any stage of processing if any failure occurs in-network or cluster nodes. They can be created either by parallelizing an existing collection in the driver program, by reading a dataset from external storage like HDFS, HBase, Cassandra, databases, etc, or by transforming an existing RDD in memory and being processed by the Spark engine. Spark makes use of RDDs to achieve the same or similar processing results as MapReduce but at a much faster speed usually of the order by 10 to 100x. Let us see
Data Sharing is Slow in MapReduce
MapReduce has established itself as one of the best technologies to process and generate huge datasets parallelly using distributed algorithms in distributed environments. It helps users and developers to do parallel computations using high-level APIs and the users do not have to get involved and take care of the intricacies of work distribution, concurrency issues and making the processing systems fault-tolerant.
When we need to reuse data between different computation stages say in MapReduce processing, the only way to do this is to write the output of the first stage to physical storage e.g. HDFS. Though the MapReduce framework gives users many abstractions to use cluster’s compute power, we as developers are a never satisfied lot and still want more out of it. Both the iterative and interactive applications need data sharing to be very fast across parallel jobs. But the data sharing is not that fast in MapReduce due to disk IO, serialization, deserialization, and replication involved in writing to stable storage for intermediate results. It is found that Hadoop applications spend almost 90% of their time in reading and writing to storage systems.
Iterative Operations on MapReduce
Iterative operations mean reusing the intermediate results from one or more steps across other multiple stages within the application. We will see in the below diagram, how MapReduce works in iterative applications and how the overheads due to replications, IO and serialization, and deserialization affects the performance of the whole application.

Interactive Operations on MapReduce
In interactive operations, the user runs an ad-hoc query on the same subset of data and each time the query will go to the disk and perform IO to fetch the data and return back to the user. This increases the interactive query time for the user and hampers the user experience. We will see below how this works in MapReduce.

Data Sharing using Spark RDD
We saw why the data sharing between intermediate steps for iterative applications and also for interactive applications is slow in MapReduce. The reason was due to disk IO, serialization and replication issues in MapReduce which are inbuilt features and core to the functioning of Hadoop MapReduce. To overcome this slowness Spark was developed which works on the concept of Resilient Distributed Datasets which essentially are in-memory objects but are partitioned and distributed on a cluster and also fault-tolerant. So the intermediate stage outputs do not require to be written to stable storage and can be accessed from memory avoiding the biggest bottleneck of the MapReduce systems. This in-memory sharing makes Spark 10 to 100 times faster than MapReduce.
Let us now see how iterative and interactive operations take place in Spark’s Resilient Distributed Datasets.
Iterative Operations on Spark RDD
The below diagram shows how Spark’s RDD work in iterative applications. The intermediate results are written to memory instead of the stable disk storage and the subsequent steps can read the same memory RDD objects. Only when the memory(RAM) is insufficient to store the whole RDD, the results are spilled to the disk. But overall the whole system is way faster than the MapReduce application due to this optimization.

Interactive Operations on Spark RDD
The diagram below shows how interactive systems can benefit from Spark RDD processing. If there are different queries to be run on the same set or subset of data, this data can be kept in memory for faster response time. By default, each transformed RDD may be recomputed with every call on the action, but even this can be optimized by using Spark’s caching mechanism which can cache the computed RDD in a distributed manner across different machines on the cluster.

In this section we understood how RDDs make Apache Spark a fast, fault tolerant and distributed processing engine.