Data science has become one of the most promising careers today. A lot of experienced professionals from different fields look to transition into a data science role, while fresh graduates aspire to land their first break into the world of data science. Since almost all data science roles expect a certain level of programming skills, it becomes essential to build familiarity with a specific tool along with the data science fundamentals. To get started, the data science bootcamp duration provides the focused coaching required for a data science track. There are three popular programming languages used in data science. These are Python, R, and SAS (Statistical Analysis System). Most data professionals, academicians, and startups prefer open-source Python and R as preferred tool choices. SAS, on the other hand, is about 50-year-old proprietary data science tool catering to the industry's demands. Let us explore more about SAS as a tool, SAS programming and SAS certification in this article.
What is SAS?
The acronym SAS stands for Statistical Analysis System, a tool offered by the SAS Institute Inc., California (USA). It is a commercial closed-source integrated system of software products designed for advanced analytics and complicated statistical processes required in Business Intelligence. Big organizations and experts employ SAS for their data science projects due to its high reliability.
SAS performs statistical modelling using basic SAS, which is the primary programming language powering the SAS environment. This tool provides a wide range of statistical capabilities for sophisticated modelling. Despite the availability of competing open-source technologies, SAS remains favoured by companies. Since SAS is a commercial tool geared towards industry demands, it is usually not a tool used by beginners or independent data science enthusiasts. Although SAS is an expensive tool, independent and academic learners as well as tutors can gain benefits from exploring SAS Studio in the cloud for free.
SAS enables users to retrieve, report, and analyse statistical data. It is an effective tool for conducting SQL queries and automating user tasks with macros. Aside from that, users can also generate descriptive visualizations through graphs, and other SAS versions provide reporting on machine learning, data mining, time series, and so on.
Some of the different areas where SAS could be used by the programmers are -
- Business planning, Forecasting, and Decision Support
- Report writing and Graphics
- Information Retrieval and Data Management
- Operations Research and Project Management
- Quality Improvement
- Applications development
- Statistical Analysis, Econometrics and Data Mining
- Data Warehousing (Extract-Transform-Load)
How Different is SAS from Python and R?
SAS has its own programming language that is similar to SQL (but not identical), and it uses a graphical user interface (GUI). Although GUIs are available for Python and R, SAS has a built-in GUI.
A few notable features of SAS products are –
- It can process millions of rows and thousands of columns i.e., it is scalable
- It consists of built-in statistical and random number functions
- It has comprehensive date and time handling functions,
- It also has functions for character and number manipulation
- It can interact with databases, operating systems and can provide output in different formats like CSV, PDF, and XML.
Thus, SAS offers identical and equivalent capabilities to Python and R for performing all data science tasks for building large scale big data solutions. These solutions can be used in Business Intelligence, IT management, Human Resource Management, Financial management, Customer Relationship management and more.
Hence, the choice of the tool boils down to personal or industry-specific preferences and/or requirements. So, if you are a beginner interested in pursuing SAS for Data Science, read on to learn more.
Getting Started with SAS Essentials
In this section, we will understand the SAS basics required to get started. The SAS Studio interface is a clean-looking interface and can also be called the SAS Windowing Environment. To write a SAS program in the SAS system (available on the SAS cloud), we will be using this interface. Let us understand the components of the SAS Studio user interface. How do you go about doing that? The first thing you need to do is to open your SAS software.
The SAS Studio program window is composed of two panes: one for navigation and the other pane with a menu and multiple tabs for SAS programming.
Before we begin with SAS programming, let us look at the SAS syntax first.
The SAS syntax is quite different from Python or R. Prior knowledge of SQL syntax is helpful in understanding the SAS syntax. Here are some key features and rules of SAS syntax:
- SAS statements are similar to sentences that end with a semicolon (;).
- Most (but not all) statements begin with a keyword such as proc, data, label, options, format, etc.
- Statements are organized into paragraph-like chunks. In a Windows OS, these paragraphs are terminated with the word "run" and a semicolon.
- SAS Comments: Type /* to begin a comment.
- SAS statements are format free. One or more blanks or special characters are used to separate words.
- SAS statements can begin and end in any column.
- A single statement can span multiple lines, while several statements can be on the same line.
SAS accepts two types of statements when running the application. These statements in a SAS program are broadly classified as data steps and procedures. We will explore the basics of SAS in this article through an exploratory data analysis using the SAS environment.
DATA Step: The data step includes all SAS statements, beginning with line data and ending with line datalines. In this step, we can define and modify the values in the relevant dataset. We use different SAS statements for reading the data, cleaning and manipulating it in the data step prior to analyzing it. The raw data gets transformed into a SAS dataset during the data stage. The terms "cards" and "datalines" are used interchangeably in this step. With the data step procedure, we can import data, provide reports on variables, and perform a descriptive analysis.
In the DATA step, four statements are often used. These are
- ‘DATA’ to identify the dataset.
- ‘INPUT’ to list the variable names.
- ‘CARDS’ to indicate that data lines will follow immediately.
- ‘INFILE’ indicates that data is stored in a file and specifies the name of the file.
The SAS program accepts various types of data inputs, like-
- existing SAS data sets (which are either SAS data sets or SAS views),
- raw data (unprocessed data that has not been previously read in a SAS data set) read from external files or streaming data.
- SAS library
- Remote access for data sources such as Azure, SAS catalogue, Hadoop, S3, zip and more.
PROC Step: The PROC step instructs SAS on what analysis was conducted on the data, such as regression, analysis of variance, mean computation, and so on. Every PROC statement begins with the term "PROC."
PROCs can be used to evaluate data in a SAS data collection, generate formatted reports or other outputs, or provide methods for managing SAS files. PROCs may be easily modified to create the desired output. PROCs can also do things like present information about SAS data collection.
We’ll cover a few statements in this article. For additional information on SAS programming, you can read the official documentation here.
Now, let us read a sample dataset using SAS programming. For demonstration, we are using a dataset called ‘auto-mpg’ in CSV format which contains features related to city-cycle fuel consumption in miles per gallon.
We will now print the first ten rows of this dataset to get an idea about the dataset features using PROC statement.
The output from the above code shows the first few rows from the imported dataset.
Applying Data Cleaning Techniques
Most of the time, the imported dataset will have issues like missing values, outliers, duplicate and/or redundant entries, as well as skewness in some of the features. If such a dataset is used for training a model, then the predictions from the model could be highly inaccurate. Hence, data cleaning is an important step in the data science project workflow.
It is important to be familiar with project requirements in order to identify which data values could be invalid. It is possible that a dataset contains variables with unique and non-missing values or features existing within a given range of values. We can detect invalid data using several SAS procedures such as PROC PRINT, PROC FREQ, PROC MEANS, and PROC UNIVARIATE.
Once we have identified the flaws in the dataset, the next step would be to clean it. This helps in preventing incorrect data from being saved in a SAS data collection. If there is still a requirement to clean a data set after it has been placed in a SAS data set, it can be done using the VIEWTABLE window or programmatically using the DATA step, PROC SQL, or PROC SORT.
Let us explore this with an example. Say we have imported a healthcare dataset which is likely to have some missing values. Now to check the count of missing values, we can use PROC MEANS with NMISS function as follows-
SAS prints a nice table with the features and the missing values.
Similarly, we can use PROC FREQ to identify invalid values in this dataset. For example, ‘age’ values cannot be zero for any healthcare dataset. So, such values are considered invalid. We can print these using -
We can see that there are 20 entries with age=0 (female= 7 and male =13) respectively.
Next, to clean this dataset, let us replace the missing values with the zero values. For this, we can use the following code -
The output data from the SAS program shows that all missing values have been replaced in the dataset with zero values, and there are no more missing values. Similarly, we can choose and use other functions in SAS to address the specific issues in the dataset.
Using SQL for Data Creation and Query
The SQL process is the Structured Query Language implementation in Base SAS. PROC SQL is included with Base SAS software and can be used with any SAS data collection (table). The PROC SQL statement is frequently used as an alternative to other SAS procedures or the DATA step. Using PROC SQL, we can create reports, display summary statistics, perform data retrieval from tables or views, create and manipulate data, and modify a PROC SQL table by adding, updating, or deleting columns from the table.
The syntax of PROC SQL SAS includes statements like -
- PROC SQL, which calls the SAS SQL procedure,
- SELECT: specifies the column(s) (variables) to be selected
- FROM: specify the table(s) (data sets) to be queried
- WHERE: filter data based on a condition
- GROUP BY: to categorize the data into groups based on the specified column(s)
- ORDER BY: to sort the resulting rows by the specified column(s)
- QUIT: to end the PROC SQL procedure.
By default, the results of a query are displayed in the SAS output window. Here is an example:
The above query results in the following output.
With this overview of SAS syntax, programming, and SAS SQL statements, let us understand reporting, how to build visualizations with SAS in the next section.
Reporting, Statistical, and Visual Analysis
Exploratory Data Analysis on a dataset always consists of relevant visualizations, tabulated information, and statistics describing the significant entities in a dataset. A well performed EDA can help data science professionals uncover the underlying trends and patterns in a data set. With SAS, we can build appropriate visualizations and display relevant information about the data set using the PROC statements.
For example, let us print the descriptive stats for the previously imported mpg dataset using the following-
This PROC statement for tabulating the data provides the following output.
Another example: We can print the standard deviation of one or more features in the dataset using
Similarly, we can build some frequently used visualizations using SAS.
1. Box plot
A box plot provides a good indication of the distribution of data i.e., how many observations lie in a particular range and which ones are the clear outliers. Using SAS, we can create a box plot that shows the outliers in a data set as tiny circles.
A histogram summarizes discrete or continuous data which are measured on an interval or range scale. The below code generates a histogram for mpg (miles per gallon) variable.
3. Scatter plot
Scatter plot displays the relationship between two numerical variables. The following sample SAS code generates a scatter plot which shows relationship between ‘acceleration’ and ‘mpg’ variables.
4. Bar chart
A bar chart is a plot with bars used for comparing different categories. To generate a bar chart in SAS, we can use the following code -
5. Stacked Bar chart
A stacked bar chart is a type of bar chart which represents the proportional contribution of individual data points compared to the total. The height of each bar represents the contribution of each group to the total. We can generate a stacked bar chart in SAS using -
SQL, Python, and R are the key tools used by data scientists for accessing, cleaning, and analyzing data, as well as developing predictive models. However, other organizations use alternative analytical tools, such as SAS, owing to their specific industry requirements, such as in healthcare, where Clinical SAS is used.
This might be perceived as SAS not being considered as a consistent standard for data science degrees. As a data scientist, it is essential to first master basic mathematics, statistics and then move on to mastering the conventional languages - SQL, Python, and R. So, the question arises, ‘Should I learn SAS and get certified in SAS programming’? It is seen that organizations prefer to select candidates who possess SAS certification. Those who have SAS certification, normally are believed to have studied all the contents and capabilities of SAS for practical applications. Hence, it is always advisable to have an SAS certification if you are a Data Science aspirant. There are mainly two types of certifications namely, Basic SAS and Advanced SAS.
One can learn SAS through self-study or through online courses offered by various academic platforms. But in this case, the learning process will be longer and more expensive (online courses). Additionally, one should be consistent enough to put in 3–4 hours a week towards learning the tool. It is also possible to engage a private tutor who himself is an experienced SAS practitioner as one-to-one interaction is beneficial in preparing for the exam. SAS’s free resources, books and technical documentation along with a few YouTube SAS tutorials are a good place to start learning.
SAS offers multiple credentials including highly specialized certification in Clinical SAS, details of which can be found here. The exam is conducted by Pearson VUE. All these exams cost $180 (discounts might be possible for academic users) for each attempt and multiple attempts are allowed till you clear the exam with 80% marks.
Preparing for SAS Base Certification Exams
One of the offered credentials from SAS is a ‘SAS Certified Specialist: Base Programming Using SAS 9.4’. According to SAS, having cleared this exam indicates that the user can -
- Read and create data files
- Create basic detail and summary reports using Base SAS procedures
- Manipulate and transform data
- Identify and correct syntax and programming logic errors.
This online exam is based on SAS 9.4 M5 and consists of 40-45 MCQ (multiple choice questions) and short answer questions. These need to be answered within 135 minutes and the passing score is 725 (score ranges from 200 to 1000 points). To prepare for the exam, SAS offers an eBook, some practice tests and sample questions to familiarize with the pattern of the exam.
Preparing for SAS Advanced Certification Exams
An advanced credential offered by SAS for programming is ‘SAS Certified Professional: Advanced Programming Using SAS 9.4’. This advanced exam is intended for users who can demonstrate their capabilities in writing and executing SAS code during their exam. It is a performance-based exam where candidates have already earned their Base programming credentials. Aspirants looking to gain this certification need to prepare for the following things.
- Use advanced DATA step programming statements to solve complex problems
- Write and interpret SAS SQL code
- Create and use SAS macros
- Access a SAS environment to work with SQL, the SAS Macro facility, and advanced coding techniques such as arrays, hash objects, and PROC FCMP
- Face coding challenges that require writing and executing SAS code to solve them.
The code and results are assessed by a scoring macro that will determine if the candidate has solved the challenge correctly or not.
Similar to the Basic Programming exam, this exam is also administered by SAS and Pearson VUE. There are 10-15 programming projects and 10-15 standard exam questions where the programming projects are assessed by a SAS scoring macro. The passing score is 725 (score ranges from 200 to 1000 points). The exam duration is 135 minutes and is based on SAS 9.4 M5.
In this article, we explored SAS as a tool for data science and how to approach learning this tool as an absolute beginner.
To begin your learning journey for SAS, you'll need to first to build your basics in Data Science. Learning a tool does require some conceptual understanding as well as some practice. Once you're familiar with the concepts of data science, such as data cleaning, data exploration, data storage, and data analysis, along with the basics of SQL, you will be able to pick up the tool faster.
Additionally, there is no prior programming experience required for SAS due to its simple and easy-to-use GUI.
As of now, there is still a great demand for SAS professionals, and hence, earning a certification in the same could prove a good career choice. There are two different levels of certification available, and beginners should be able to earn a certificate in the basic programming course offered by SAS on their website.
Although Python has become the popular choice of many professionals today since it is an open-source tool, SAS is nevertheless a powerful and well-established commercial platform that is still being used by a lot of companies for their data science projects. Hence, it could be worth investing time for beginners desiring to pursue data science with SAS.