
Domains
Agile Management
Master Agile methodologies for efficient and timely project delivery.
View All Agile Management Coursesicon-refresh-cwCertifications
Scrum Alliance
16 Hours
Best Seller
Certified ScrumMaster (CSM) CertificationScrum Alliance
16 Hours
Best Seller
Certified Scrum Product Owner (CSPO) CertificationScaled Agile
16 Hours
Trending
Leading SAFe 6.0 CertificationScrum.org
16 Hours
Professional Scrum Master (PSM) CertificationScaled Agile
16 Hours
SAFe 6.0 Scrum Master (SSM) CertificationAdvanced Certifications
Scaled Agile, Inc.
32 Hours
Recommended
Implementing SAFe 6.0 (SPC) CertificationScaled Agile, Inc.
24 Hours
SAFe 6.0 Release Train Engineer (RTE) CertificationScaled Agile, Inc.
16 Hours
Trending
SAFe® 6.0 Product Owner/Product Manager (POPM)IC Agile
24 Hours
ICP Agile Certified Coaching (ICP-ACC)Scrum.org
16 Hours
Professional Scrum Product Owner I (PSPO I) TrainingMasters
32 Hours
Trending
Agile Management Master's Program32 Hours
Agile Excellence Master's ProgramOn-Demand Courses
Agile and ScrumRoles
Scrum MasterTech Courses and Bootcamps
Full Stack Developer BootcampAccreditation Bodies
Scrum AllianceTop Resources
Scrum TutorialProject Management
Gain expert skills to lead projects to success and timely completion.
View All Project Management Coursesicon-standCertifications
PMI
36 Hours
Best Seller
Project Management Professional (PMP) CertificationAxelos
32 Hours
PRINCE2 Foundation & Practitioner CertificationAxelos
16 Hours
PRINCE2 Foundation CertificationAxelos
16 Hours
PRINCE2 Practitioner CertificationSkills
Change ManagementMasters
Job Oriented
45 Hours
Trending
Project Management Master's ProgramUniversity Programs
45 Hours
Trending
Project Management Master's ProgramOn-Demand Courses
PRINCE2 Practitioner CourseRoles
Project ManagerAccreditation Bodies
PMITop Resources
Theories of MotivationCloud Computing
Learn to harness the cloud to deliver computing resources efficiently.
View All Cloud Computing Coursesicon-cloud-snowingCertifications
AWS
32 Hours
Best Seller
AWS Certified Solutions Architect - AssociateAWS
32 Hours
AWS Cloud Practitioner CertificationAWS
24 Hours
AWS DevOps CertificationMicrosoft
16 Hours
Azure Fundamentals CertificationMicrosoft
24 Hours
Best Seller
Azure Administrator CertificationMicrosoft
45 Hours
Recommended
Azure Data Engineer CertificationMicrosoft
32 Hours
Azure Solution Architect CertificationMicrosoft
40 Hours
Azure DevOps CertificationAWS
24 Hours
Systems Operations on AWS Certification TrainingAWS
24 Hours
Developing on AWSMasters
Job Oriented
48 Hours
New
AWS Cloud Architect Masters ProgramBootcamps
Career Kickstarter
100 Hours
Trending
Cloud Engineer BootcampRoles
Cloud EngineerOn-Demand Courses
AWS Certified Developer Associate - Complete GuideAuthorized Partners of
AWSTop Resources
Scrum TutorialIT Service Management
Understand how to plan, design, and optimize IT services efficiently.
View All DevOps Coursesicon-git-commitCertifications
Axelos
16 Hours
Best Seller
ITIL 4 Foundation CertificationAxelos
16 Hours
ITIL Practitioner CertificationPeopleCert
16 Hours
ISO 14001 Foundation CertificationPeopleCert
16 Hours
ISO 20000 CertificationPeopleCert
24 Hours
ISO 27000 Foundation CertificationAxelos
24 Hours
ITIL 4 Specialist: Create, Deliver and Support TrainingAxelos
24 Hours
ITIL 4 Specialist: Drive Stakeholder Value TrainingAxelos
16 Hours
ITIL 4 Strategist Direct, Plan and Improve TrainingOn-Demand Courses
ITIL 4 Specialist: Create, Deliver and Support ExamTop Resources
ITIL Practice TestData Science
Unlock valuable insights from data with advanced analytics.
View All Data Science Coursesicon-dataBootcamps
Job Oriented
6 Months
Trending
Data Science BootcampJob Oriented
289 Hours
Data Engineer BootcampJob Oriented
6 Months
Data Analyst BootcampJob Oriented
288 Hours
New
AI Engineer BootcampSkills
Data Science with PythonRoles
Data ScientistOn-Demand Courses
Data Analysis Using ExcelTop Resources
Machine Learning TutorialDevOps
Automate and streamline the delivery of products and services.
View All DevOps Coursesicon-terminal-squareCertifications
DevOps Institute
16 Hours
Best Seller
DevOps Foundation CertificationCNCF
32 Hours
New
Certified Kubernetes AdministratorDevops Institute
16 Hours
Devops LeaderSkills
KubernetesRoles
DevOps EngineerOn-Demand Courses
CI/CD with Jenkins XGlobal Accreditations
DevOps InstituteTop Resources
Top DevOps ProjectsBI And Visualization
Understand how to transform data into actionable, measurable insights.
View All BI And Visualization Coursesicon-microscopeBI and Visualization Tools
Certification
24 Hours
Recommended
Tableau CertificationCertification
24 Hours
Data Visualization with Tableau CertificationMicrosoft
24 Hours
Best Seller
Microsoft Power BI CertificationTIBCO
36 Hours
TIBCO Spotfire TrainingCertification
30 Hours
Data Visualization with QlikView CertificationCertification
16 Hours
Sisense BI CertificationOn-Demand Courses
Data Visualization Using Tableau TrainingTop Resources
Python Data Viz LibsCyber Security
Understand how to protect data and systems from threats or disasters.
View All Cyber Security Coursesicon-refresh-cwCertifications
CompTIA
40 Hours
Best Seller
CompTIA Security+EC-Council
40 Hours
Certified Ethical Hacker (CEH v12) CertificationISACA
22 Hours
Certified Information Systems Auditor (CISA) CertificationISACA
40 Hours
Certified Information Security Manager (CISM) Certification(ISC)²
40 Hours
Certified Information Systems Security Professional (CISSP)(ISC)²
40 Hours
Certified Cloud Security Professional (CCSP) Certification16 Hours
Certified Information Privacy Professional - Europe (CIPP-E) CertificationISACA
16 Hours
COBIT5 Foundation16 Hours
Payment Card Industry Security Standards (PCI-DSS) CertificationOn-Demand Courses
CISSPTop Resources
Laptops for IT SecurityWeb Development
Learn to create user-friendly, fast, and dynamic web applications.
View All Web Development Coursesicon-codeBootcamps
Career Kickstarter
6 Months
Best Seller
Full-Stack Developer BootcampJob Oriented
3 Months
Best Seller
UI/UX Design BootcampEnterprise Recommended
6 Months
Java Full Stack Developer BootcampCareer Kickstarter
490+ Hours
Front-End Development BootcampCareer Accelerator
4 Months
Backend Development Bootcamp (Node JS)Skills
ReactOn-Demand Courses
Angular TrainingTop Resources
Top HTML ProjectsBlockchain
Understand how transactions and databases work in blockchain technology.
View All Blockchain Coursesicon-stop-squareBlockchain Certifications
40 Hours
Blockchain Professional Certification32 Hours
Blockchain Solutions Architect Certification32 Hours
Blockchain Security Engineer Certification24 Hours
Blockchain Quality Engineer Certification5+ Hours
Blockchain 101 CertificationOn-Demand Courses
NFT Essentials 101: A Beginner's GuideTop Resources
Blockchain Interview QsProgramming
Learn to code efficiently and design software that solves problems.
View All Programming Coursesicon-codeSkills
Python CertificationInterview Prep
Career Accelerator
3 Months
Software Engineer Interview PrepOn-Demand Courses
Data Structures and Algorithms with JavaScriptTop Resources
Python TutorialSpark SQL and its interfaces DataFrames and Datasets are the future of Spark performance. DataFrames and Datasets are the most important features in getting the best performance out of Spark for the structured data. These data structures use more efficient storage options and optimizers to give users the best performance.
SQL Engine was introduced in Spark 1.0, DataFrames in Spark 1.3 and Dataset in Spark 1.6.
Developers in Spark mostly use DataFrames/Datasets for all processing of data.
DataFrames and Datasets in Spark are higher-level APIs which internally use RDDs. Even though we can do anything we wanted to do with our data with RDDs, the higher-level APIs DataFrames and Datasets allow us to become proficient with Spark quicker especially if you have RDBMS and SQL background.
SparkSQL is the module for structured data processing with the added benefit of Schema for the data which we did not have for RDDs. Schema gives more information about the data which Spark is processing. Hence it can perform more optimizations on the data during the processing. And we can work on the data using interactive SQL queries which adhere to the 2003 ANSI SQL. It is also compatible with HIVE.
The other option for querying and processing data is the DataFrames. DataFrames have distributed a collection of row objects. A row is an object which contains the data and we can access each column of the data. So DataFrames can be thought of as a database table with the data organized in rows and columns. In Spark 2.0 the higher-level APIs were unified to Dataset. DataFrame can be thought of as a row of Dataset i.e Dataset[Row]. The DataFrames can be converted to RDDs and then back to DataFrames as and when required.
Querying DataFrames/Datasets is very easy. Querying DataFrames can be done using Domain Specific Language (DSL) and is very relational in nature. This allows Spark for optimizations.
The below diagram shows the steps in Query execution in SparkSQL/DataFrames/Datasets.

When a query is executed it is resolved into an unresolved logical plan. This means there are unresolved attributes and relations in the plan. So then it has to look into the catalog to fill in the missing information for the plan. This leads to the generation of a logical plan. Here a series of optimizations are performed which generates an optimized logical plan. This optimization engine in Spark is called Catalyst optimizer. The optimized plan is then converted to multiple physical plans where a Cost model is used to select an Optimal Physical Plan. This then gets into the final Code Generation step and then the final query is executed to generate the final output as RDDs.
Let’s look at how we can create DataFrames/Datasets or how to execute Spark SQL. We have seen how SparkContext can be created. In Spark 2.0 we have something called SparkSession which is a simplified entry point for Spark applications. SparkSession encapsulates the SparkContext. Earlier Spark had different contexts to use with different use cases like SQLContext, HiveContext, and SparkContext. All this is unified into SparkSession which simplifies things for developers as there is no confusion as to which context to use.
Another benefit of having SparkSession is unlike SparkContext we can create multiple SparkSessions when needed like below.

Let’s look at how we can create a DataFrame and query the data. DataFrames can be created by loading data from external files from the filesystem, HDFS, S3, RDBMS, HBase, etc. They can also be created from existing DataFrames by applying transformations. For simplicity we will create a DataFrameby following method as below:

We can collect() the RDD created from our DataFrame but we do not get to see what the DataFrame intended to give us. So we do a .show() which gives us a nice tabular view of our data. The other thing we can do is to check the schema using .schema or .printSchema().
If we just do a listDf, and press tab we can see all the methods available.

Also, since we did not provide any column names to our DataFrame we see _1,_2 as the default names. We can change it by giving proper names using toDF() function.

We can query the DataFrames similar to how we query a Table using SQL. This is very similar by using Domain-Specific Language in Spark. Below we can see the way it can be done.
SQL | Spark DSL |
Select * from Country | listDF2.show() |
Select Id from Country | listDF2.select(“Id”).show() |
Select * from Country where Id = 1 | listDF2.select("*").where(col("Id") === 1 ).show |

The DataFrame can also be saved in a filesystem, HDFS, S3, etc. Here we just save it to file system.


Working with Spark SQL is very similar to working with DataFrames. The advantage is that we can use our familiar ANSI SQL queries instead of DSL which makes it very convenient and reduces the learning curve a lot. For that, we just need to register our DataFrame as a temporary table or a view and then we can run all our SQL queries. Let’s see how to do it below:

Let’s move on to Datasets. Datasets are very similar to DataFrames with the distinction that they have a strongly typed collection of objects. So they are type safe. This helps us catch some of the errors at compile time which is not possible with RDDs and DataFrames. And using Datasets is almost similar to using DataFrames like processing and querying we have seen earlier.
Creating Dataset

Creating Dataset from DataFrames

Creating Dataset from RDD

Now since we have understood all the three APIs Spark provides i.e. RDD, DataFrame, and Dataset, we should understand which one to use when and how is the performance of each of these APIs.
RDD: We should use RDDs in the following use cases:
DataFrame: We should use DataFrames in the following use cases:
We should remember that DataFrames and Datasets are unified API since Spark 2.0 and so most of the functionalities are now available in both.
Dataset: We should go with Dataset for the following reasons:
We can summarize the above three on basis of performance in the following manner:

In this section we looked at the higher level APIs and understood when and how to use them. We also looked at their performance comparison which gives us a clear picture about their usage.