For enquiries call:

Phone

+1-469-442-0620

Aage ki Socho

HomeBlogData ScienceMaster SQL for Data Science [Beginners Guide]

Master SQL for Data Science [Beginners Guide]

Published
05th Sep, 2023
Views
view count loader
Read it in
11 Mins
In this article
    Master SQL for Data Science  [Beginners Guide]

    While learning a new subject, there have always been two major approaches that most of us take up. First is the Classic method of systematically learning all the concepts by going step-by-step from basic to advance levels. The other one is trying to learn everything one night before the interview or an exam. For Sequel data analytics, we may have seen many instances where our colleagues start rote learning to get a hang of the topics and concepts. This method might get you through your interview, but in the long run and in your daily work you will be back to struggling and going around googling the same stuff repeatedly to perform simple tasks.

    The road to mastering SQL programming specifically for data science, which many employers still seek as the most in-demand skill, is the topic we will be discussing in this article. If we were to look at the skill trends across the past few years, we will find SQL was always a part of the top 10 programming languages.  To polish your Data Science skills, you can opt for Data Science and Machine Learning online course.

    In demand Programming languages

    What is SQL (Structured Query Language)? 

    SQL or Structured Query Language or ess-cue-ell or even SEQUEL, let us first try to understand what SQL is and how it works. Also, know that Sequel programming languages is a popular term.

    Structured Query Language (SQL) is a standardized programming language that is used to manage relational databases and perform various operations on the data that is present on them.

    With SQL you will be able to create, insert, alter, update and filter data to any database format that you will see fit or optimum for further usage. Here, you will find new terminologies to determine the applications of SQL based on the set of commands or statements used:

    1.  Data Definition Language (DDL): The commands that are a part of DDL are the ones that will be used for defining a database schema. DDL consists of commands that are used to create, modify and delete database structures. These commands are generally used by the Database admins and lesser by the general users. Ex: CREATE, ALTER, DROP, TRUNCATE .
    2.  Data Query Language (DQL): These statements used within DQL are used for querying the data from the schema. DQL includes the SELECT command. This command allows getting data from the database to perform operations with it.
    3.  Data Manipulation Language (DML): This section of SQL consists the most number of SQL commands you will use to manipulate the tables that are either present in the database or a table created by yourself using DDL and the data within them. Ex: INSERT, UPDATE, DELETE, LOCK, etc. Tables denote that related data is held in the form of rows and columns.
    4.  Platforms for SQL: We can also find that platforms like Hadoop and Spark which are popular services for handling and working with Big Data, provide an extension for querying and manipulating using DDL, DQL and DML commands of SQL.
    5.  Data Control Language(DCL): Commands included in this section are GRANT and REVOKE which are essentially used for providing rights, permissions and other controls within the database systems. Mainly the database admins would be dealing with DCLs.

    The front-end to the back-end connection is established through the Client/Server language, which is SQL and is extensively used in the Data Science industry.

    Why SQL for Data Science?

    Reiterating the above point of SQL for Data Science being the most in-demand skill and relevant even after four decades of its existence, tells us a lot about how it offers several key features and advantages over the alternatives. It can also be noted how almost all the Data Science job postings that we see today on different platforms list SQL as one of the primary skills. In another survey from the 2020 LinkedIn top 10 Startups from India list, 7 have SQL listed as one of their Most Common Skill.

    Need for SQL in Data Science 

    The following points highlight the need for SQL in Data Science:

    • We can define Data Science as simply an all-around study of the data itself. To enable this, the first step is to extract this data from the databases where SQL comes into the picture. Most of the companies we see today find Relational Database management as a crucial part of their data management ecosystem. SQL allows us to perform all the actions on these databases such as defining, manipulating, creating, querying the databases and many more.
    • We can also find that platforms like Hadoop and Spark which are popular services for handling and working with Big Data, provide an extension for querying and manipulating using DDL, DQL and DML commands of SQL.
    • For all the data stored in Oracle, PostgreSQL and Microsoft SQL Server which are examples of a few popular relational databases used by large companies, we would need SQL as well.
    • SQL also becomes a necessary tool for carrying out data preparation and wrangling actions. Therefore, to deal with various Big Data tools, we will be making use of SQL.

    Key Elements of SQL for Data Science 

    Key Elements of SQL
    These are a few concepts that we believe will be essential for using SQL within Data Science. The topics mentioned here will help you focus on areas that are crucial to know. We will be looking into these topics as we keep going further into the article. 

    Using SQL for Data Science 

    Using SQL for Data Science

    1. Data Understanding

    One of the most crucial aspects of being able to solve a problem in any domain/industry of Data Science is to understand the data to the fullest. In most companies, there is an “80/20” rule that is followed whereas it is seen that 80% of a data scientist’s valuable time is spent simply finding, organizing, and cleansing the data, and only 20 percent to performing actual analysis on the prepared data. 

    The process or steps involved in data understanding can be as follows:

    A) Creating a Data Dictionary

    This step simply involves noting down what each column is trying to tell us through the range of values it consists. We can even get our data dictionary validated by the clients.

    B) Looking out for Data Discrepancies

    Here, we can identify the columns with missing values, NAs and outliers. This helps us in cleaning the data as well.

    C) Getting Statistics of Numerical Columns

    Through simple functions, we can get the distribution of numerical columns in the tables. We can look at the mean, median, deviation and variation.

    D) Using Visualisations

    Sometimes it gets tricky to understand a few columns as they might have a wide range spread across large values. Graphs and visualisations come in handy at this point of time.

    2. Business Understanding

    When we mention that we need to conduct multiple sessions with the business/stakeholder to understand the data, there is also another aspect that we need to consider i.e., the business understanding. The understanding of data is incomplete without the business aspect, as we will be missing out on crucial contexts that we can acquire through the business.

    Here, let us look into an example of a business problem to find the annual revenue, profits and losses and the year-on-year growth of stores across the country.

    The problem, even though seemed straightforward, having interactions with the business/stakeholders will give us directionality as to what products to focus on, what stores to put more weightage on, and many such details which enable a data scientist to shape their analysis. The two stages above help us to fill in the blanks within the queries that we will want to work on as we progress with data and business understanding.

    For instance, understanding the core transactions that happen in a business such as the telecom industry can help us interpret the data in a more holistic manner as we can understand factors such as average talk time, churned customers, and loyal customers better. To begin your journey in Data Science, enroll in KnowledgeHut Data Science and Machine Learning online course.

     

    3. Profiling Data

    We can define profiling data as the process of analyzing, examining, reviewing, and summarizing data sets to gain insight into the quality of data.

    • Data Quality can be measured using the factors of accuracy, completeness, consistency, timelines and accessibility of the data. To move any further with the actual analysis of the data, it is necessary to conform to a high quality of the data. A good way to understand this is with the primary key in SQL(A SQL server primary key is any column that possesses unique values for each record. A primary key example in SQL can be an employee_id column in a table of employee details).
       
    • Data Profiling involves reviewing the source data to be able to understand the structure of the data, its content and the interrelationship between them. This process of data profiling benefits any organization two-fold. It helps them to gain a high-level view of the quality of the data and also helps them identify potential data projects for future.

    Once the data profiling step of managing the data with SQL with all its quality checks and ensuring the data is at its highest standard, the next step is to query the data using SELECT and FROM. 

    4. Start with Select 

    Upon completing the above three sections, you will be starting with SELECT. The SELECT command in SQL is an easier place to start your analysis. As the data would be new to look at, one can simply begin by starting with a single table first. Once you start getting comfortable with the analysis, more data and tables can be added to it and you can go back from there. 

    This leads to the next step of looking into the testing and troubleshooting of the queries that we create. 

    5. Test & Troubleshoot 

    This step can also be called the QC step or the Quality Check. Herein, you will be checking the number of records your query will be returning. If the query included all the correct filters to get all the required data points to calculate any statistics, this step becomes crucial as getting results is a simpler task, but making sure it is accurate is of more importance. 

    The next stage is to be able to resolve any queries when we see any discrepancies in the outcomes or the queries themselves. By beginning to dissect the query step by step to identify the source of the issue, we can keep the procedure straightforward. This technique is also useful in that it will enable users to create subsequent queries that are more effective. 

    Code Refactoring, Code Versioning and Documentation 

    We all know the frustrations of having to decode a bug from a badly formatted code. It wastes an ample amount of time to debug it and even then, we are still left with badly written code. 

    A bit of digging into this aspect reveals that most programmers want code that simply works and also wants it with a quick turnaround, which makes formatting the last priority for any one of us. 

    Code Refactoring involves changing the factoring of already-written computer code without altering its external behaviour. Refactoring aims to maintain the functioning of the software while enhancing its design, structure, and/or implementation. 

    An effective method to avoid that messy code is to maintain two versions of your code, one which you keep to yourself for figuring out the code and the other version which is production ready for anyone else to take it up and also understand it easily. One adequate method to follow in SQL is to capitalize the commands in your statement and the rest of it can follow the lower case. Like the example shown below. 

    Code Refactoring, Code Versioning and Documentation

    The next key suggestion will also be to comment on the codes that are written, this will make the code easily readable as the objective is already stated before the code. This can be seen in the example below. This will also ensure documentability on the code which can always be used for later purposes and will also make it easier for transition.

    Code Refactoring, Code Versioning and Documentation

    7. Review

    This forms the last step of the process and one of the most tedious parts of it as well. The entire effort that you put in until now will basically be reviewed from the top to bottom here. This will ensure you have captured all the data understanding and business rules effectively, conformed with the highest quality of data, refactoring, maintaining versions of the code and keeping it well documented will be crucial.

    SQL for Various Data Science Languages

    Until now we have understood how SQL can be used to effectively understand the data by using queries to define and manipulate the data. But what if we were to tell you that SQL can be used within various platforms as well such as R, Python, Hadoop and Spark.

    Also, we need to be wary of the fact that SQL queries within R and python can only tell us about what has happened in the past and cannot predict future. So, their uses in these platforms are only limited to fetching, querying and understanding the data. 

    The general procedure we are going to look at for using SQL within R and Python is to be able to: 

    1. Setting up SQL, 
    2. Creating (from scratch) and fetching datasets from databases and 
    3. To run queries in the same environment of the R and Python workspace 

    1. SQL for R 

    a. The DBI package can be used in the R package for setting up a connection to databases. The DBI package has options for connecting with popular databases. MariaDB, Postgres, SQLite, etc. are just a few of them.

    b. Importing tidyverse library gives us access to dplyr, ggplot and default datasets, which we can use to fetch default tables. Or you can even create your own data sets by using the CREATE table syntax and then use INSERT to add in values.

    c. DBI package can also be used to query the data by using dbGetQuery() function. You can do it simply by pasting your SQL code into the R function within a quoted string. This method is also referred to as pass-through SQL code. An example of this use case can be seen below: 

    Code Refactoring, Code Versioning and Documentation

    Additionally, you can also use dplyr, by writing your code in dplyr with R syntax. dplyr has a feature that will convert your R code into SQL queries.

    2. SQL for Python

    In this sub-section, we will be discussing how to manage databases and use SQL for Data Science with Python which can also lead to knowing that it is possible to use sequel (SQL) for data analytics. 

    a. SQLAlchemy is another popular python module that lets the user connect to all kinds of databases by providing a unified API. The process involves creation of an engine where you can specify the characteristics such as database name, username, password, host and port for that particular database. 

    b. Once the connection has been established to the preferred database, fetching all the tables within that database becomes a possibility or you can also go ahead and create tables on your own. SQLAlchemy also allows us to look at all the tables present in the database and hence helping you to fetch all the tables within it.

    c. Two steps are needed to be performed using SQL Alchemy for you to be able to write SQL queries in python. The first one is to include your query within text() and then utilize the conn.execute() function to execute your query in that environment

    d. A full example of a connection, fetching a table, and execution of a query can be found here, 

    Code Refactoring, Code Versioning and Documentation

    In the next sub-section, we will be looking at how SQL finds its application in Apache Hadoop and Spark frameworks. But first, a quick context about both of these frameworks. The Apache Software Foundation developed both Hadoop and Spark, two popular open-source frameworks for big data structures. Each framework has a rich ecosystem of open-source tools for handling, managing, and analyzing large amounts of data. 

    3. SQL with Hadoop 

    SQL implementation on the Hadoop platform, known as SQL on Hadoop, combines normal SQL-style structured data querying with the Hadoop data architecture. Big data and Hadoop are both relatively new platforms, and few professionals are experts in either. However, SQL on Hadoop makes it simpler to access the Hadoop framework and easier to integrate it into existing enterprise systems. 

    One of the SQLs on Hadoop implementation include the widely popular Spark SQL module, which we shall elaborate in the below section. 

    4. SQL for Spark 

    A Spark module for processing structured data in Spark SQL. It offers the programming abstraction called DataFrames and also functions as a distributed SQL query engine. The features also include powerful integration with the rest of the spark ecosystem such as integrating SQL query processing with ML.

    Prerequisite 

    SQL is a relatively easier topic to start learning, which means it will require minimum prerequisites, to begin with. But, we shall outline a few topics that are good to know before you begin your journey to learn SQL.

    In the above segment, we have showcased how to use SQL in R and Python to connect various popular databases. A good place to start with will be to understand the concepts of Relational Database Management Systems (RDBMS). You can explore the following concepts along with understanding what relational model is used for building relational databases:

    • Table attributes 
    • SQL constraints 
    • Data Integrity categories 
    • Data Normalization 

    These concepts will give us a clear picture of managing data with SQL and the process of creating tables. Another important concept that would make it easier for us to retrieve data in a fast manner is understanding how indexing works in SQL. Alongside, you can check KnowledgeHut data science course details. 

    Getting Started with SQL 

    You can use rely on this section to build your first table. 

    Step 1: Creating Tables 

    We shall be using the CREATE command over here and also have a look at its components. 

    Components of a SQL CREATE statement

    Components of a SQL CREATE statement

    Executing this above statement will create a new table in your environment. See, it’s that easy. Now go ahead and try out this query over here

    Step 2: Adding/Inserting Data to the Table 

    Now the next step is to go ahead and add data to the table that you have just created. This step can be accomplished by using the INSERT command. 

    Components of a SQL CREATE statement

    One should be careful while using the INSERT statement to insert the respective values into their correct positions w.r.t to their columns assigned during the CREATE statement. 

    Step 3: Retrieving Data 

    Now that we have created our first table, the next step will be to retrieve this data. This will be a straightforward process of querying the table with its name using the SELECT and FROM commands. 

    Retrieving Data

    We can retrieve data by using the SELECT and FROM commands 

    When a table consists of multiple rows, we can use the WHERE command along with SELECT to retrieve specific rows. The WHERE clause is generally used with a combination of various conditions for filtering the data while retrieving it from either created tables or retrieving tables from databases. 

    To look at the entire flow of the code at once, 

    Retrieving Data

    Adding Comments to SQL

    It is crucial to include comments in the SQL queries that you write. It improves the following aspects:

    • Readability of the query, as the purpose of the query shall be included with each code chunk created.
    • The query shall appear much cleaner which makes it more consumable for any other person trying to debug or use it.
    • Plus, as “extra marks for good handwriting” goes, other users will definitely be thanking us for maintaining the standards.

    We will be looking at two methods to add comments to our code:

    • Begin the comment with ‘/*’, then proceed with the content of the comment which can span multiple lines as well. The comment can be terminated with ‘*/’. This format of commenting code can also help in documenting which can be utilized in the future for reference. For example:

    Adding Comments to SQL

    • Another way to add comments to your query is by using ‘--’. The advantage of using this method is that it can be added along the lines of code as we can see in the example here: 
    • This method of adding comments to the code also helps one to document single lines of code which again are good documentation practices. 

    Adding Comments to SQL

    Commenting on your code becomes essentially necessary as you start writing more advanced queries. As codes start becoming more complex, adding these blocks will help you a great deal in comprehending your own code.

    Filtering, Sorting and Calculating Data with SQL

    As we had mentioned that being a data scientist you will spend almost 80% of your time understanding your data. Here we are going to introduce a few concepts that will help you approach this process: 

    1. Filtering Data and Sorting Data

    You might have already observed that we have used the WHERE clause in a few of our examples above. WHERE clause within SQL helps us filter our data from the database at a level that you will want to see your data. It basically filters records from a table that fulfil a specific condition specified by you.

    For instance, let us say from your Students (A table containing student records with their test scores) table you wanted to have a look at the student scores, but only want to focus on those who scored greater than 75. You also want to sort them in descending order to rank them, then your query will look like the following chunk:

    Adding Comments to SQL

    Adding Comments to SQL
    You can have multiple conditions included in the WHERE clause and can various operands to slice your data at a level you are seeking to fetch it. 

    The ORDER BY clause is simply used for sorting data in either ascending or descending order. The default setting for ORDER BY is sorting the values in ascending order. The ORDER BY clause must be included only after the filtering clause of WHERE as stated by the order of execution in SQL. 

    2. Aggregating/Calculating Data

    Another direction we can look into for SQL’s functions is the aggregation section. Using the same example above, you want to aggregate scores for a student across all the subjects. Use aggregate functions as shown below: 

    Adding Comments to SQL

    Aggregate functions are used with GROUP BY clause which will create groups of specified columns and apply the aggregation functions to them. 

    Here, we would strongly recommend you understand order of execution in SQL (which we shall introduce in the next section). Order of execution in SQL tells us exactly in what order will a query be performing its actions even though it’s written in a standard SQL format. Understanding this will enable us to write efficient queries.

    6 SQL Tricks Every Data Scientist Should Know 

    Order of Execution 

    6 SQL Tricks Every Data Scientist Should Know

    1. Master the “GROUP BY” Clause in SQL

    One of the most often used clauses in SQL is "GROUP BY". It can be used to aggregate data, and there are numerous ways to make it work effectively for you. The different ways you can think of using GROUP BY with one or multiple columns are: 

    A. Use ORDER BY for sorting your data post application of GROUP BY.

    B. HAVING clause is specifically used with GROUP BY to filter the aggregated/grouped data.

    These are a few tips you can begin with and dive deeper into using GROUP BY with more complex queries.

    2. Joins

    Often you might require analyzing data tables that might contain information in more than one table. It becomes a tedious task to look into both of these tables separately to come up with an analysis. To save your time with this, there exist multiple types of joins.

    Joins indicate how the SQL server should use the data from one table to select the rows from another table. The different types of joins that you use are:

    6 SQL Tricks Every Data Scientist Should Know

    Mastering Joins in SQL will help you deal with multiple tables at once post the application of joins. Although, one aspect you need to be extra careful about is the concepts of types of keys in SQL which are primary, secondary, foreign, composite, super, and alternate keys. Knowing them will help you in effectively using Join statements.

    3. Nested Queries or Sub-queries

    A Sub-query or a Nested query is a query within another SQL query and is embedded within the WHERE clause. Subqueries come in handy when the result that you are looking for requires more than a single query and each subquery provides a subset of the table involved in the query.

    4. Stored Procedures

    Python user-defined functions or UDFs are quite well known around and used extensively in the Python environment. SQL's equivalent of User Defined Functions is Stored Procedures. A stored procedure is basically a prepared SQL code that can be saved and used multiple times to perform the same action. Here is an example below:

    i. Here we show the syntax to create a stored procedure for simply retrieving all the data from CustomerTable.

    6 SQL Tricks Every Data Scientist Should Know

    ii. Secondly, how to call the stored procedure.

    6 SQL Tricks Every Data Scientist Should Know

    5. Working with Indexes and Null Values

    With the help of indexes, we can load the data easily onto the database. Indexes also help us with special lookup tables. So, including indexes can add benefits while querying the data.

    Null values can cause issues in the future if they are not dealt with as part of the data preparation step. We might face issues like not being able to use formulas, calculations and other analyses to get accurate results. So, it becomes necessary to understand null values and deal with them by applying business context.

    To create indexes which help us to retrieve data we can do the following:

    6 SQL Tricks Every Data Scientist Should Know

    Steps to Master SQL for Data Science [From Scratch]

    Moving on to the final element here, we would like to highlight how you can master SQL specifically for Data Science. Fundamentally, Data Science is all about understanding data and using it to create impactful insights through analysis and modelling using AI/ML that can drive decisions to bring about changes in behaviour. So where does SQL fit into this equation here?

    We know that Data Science pipeline consists of a crucial step to creating a dataset that will enable one to conduct the analysis, run ML models, and gather findings and insights. It is also called as the ‘Analytical Dataset’ or ADS. There have been multiple occasions where one might have wondered how easy it might have been to utilize SQL to create the analytical dataset and move it further.

    Keys in SQL 

    A Key is a single or combination of multiple fields in a table. Keys are an attribute of database management systems. This plays an important role to connect and create relationships between two or more tables Well, this is the definition, in simpler terms to communicate with different columns of different tables in databases we use keys and its property.

    Primarily, we have Seven types of keys as follow.

    1. Super key: It is a collection of single or more keys used to point out a certain type of information in a table.
    2. Candidate key: It is a collection of single or more columns that can point out a row uniquely in a table. 
    3. Primary key: It is a collection of single or multiple fields/columns of a table that uniquely identify a record in a database table. It cannot accept null, duplicate values. Only one Candidate Key can be Primary Key in table. 
    4. Alternate key: It is a key that can work as a primary key. However, a candidate key could not be used as a primary key.
    5. Composite/Compound key: It is a union of multiple fields/columns of a table. It can be a Candidate key or a Primary key. 
    6. Unique key: It is constraint which makes sure that all values in a column are different.
    7. Foreign Key: It is an attribute in a database table, which is the Primary key in a different table. It can accept multiple nulls and duplicate values.

    Keys in SQL

    SQL is a domain-specific language used for storing, manipulating, and reclaiming data in databases. SQL stands for Structured Query Language and works with many relational database systems like MySQL, Oracle and SQL servers. With the introduction of big data, QL plays important role in relational databases.

    Conclusion 

    In this article, we have gone through immense details on how you can master SQL for Data Science. One of the main reasons why SQL is still so prevalent is its integration with other languages such as Python, Spark, R and Hadoop.

    Given the extensive application in various enterprises, languages like Python, Spark constant evolving usage across multiple domains is helping SQL stay the #1 choice for most data-related interviews. A better way to learn the fundamentals of SQL for Data Science is with the support of KnowledgeHut data science with Python courses.

    Profile

    Sushil Deore

    Author

    A city dweller who loves to travel, interested in outdoor activity most on road trips also, A data science professional with a liking for deep learning.

    Share This Article
    Ready to Master the Skills that Drive Your Career?

    Avail your free 1:1 mentorship session.

    Select
    Your Message (Optional)

    Upcoming Data Science Batches & Dates

    NameDateFeeKnow more
    Course advisor icon
    Course Advisor
    Whatsapp/Chat icon