A city dweller who loves to travel, interested in outdoor activity most on road trips also, A data science professional with a liking for deep learning.
Read it in 11 Mins
While learning a new subject, there have always been two major approaches that most of us take up. First is the Classic method of systematically learning all the concepts by going step-by-step from basic to advance levels. The other one is trying to learn everything one night before the interview or an exam. For Sequel data analytics, we may have seen many instances where our colleagues start rote learning to get a hang of the topics and concepts. This method might get you through your interview, but in the long run and in your daily work you will be back to struggling and going around googling the same stuff repeatedly to perform simple tasks.
The road to mastering SQL programming specifically for data science, which many employers still seek as the most in-demand skill, is the topic we will be discussing in this article. If we were to look at the skill trends across the past few years, we will find SQL was always a part of the top 10 programming languages. To polish your Data Science skills, you can opt for Data Science and Machine Learning online course.
SQL or Structured Query Language or ess-cue-ell or even SEQUEL, let us first try to understand what SQL is and how it works. Also, know that Sequel programming languages is a popular term.
Structured Query Language (SQL) is a standardized programming language that is used to manage relational databases and perform various operations on the data that is present on them.
With SQL you will be able to create, insert, alter, update and filter data to any database format that you will see fit or optimum for further usage. Here, you will find new terminologies to determine the applications of SQL based on the set of commands or statements used:
The front-end to the back-end connection is established through the Client/Server language, which is SQL and is extensively used in the Data Science industry.
Reiterating the above point of SQL for Data Science being the most in-demand skill and relevant even after four decades of its existence, tells us a lot about how it offers several key features and advantages over the alternatives. It can also be noted how almost all the Data Science job postings that we see today on different platforms list SQL as one of the primary skills. In another survey from the 2020 LinkedIn top 10 Startups from India list, 7 have SQL listed as one of their Most Common Skill.
The following points highlight the need for SQL in Data Science:
One of the most crucial aspects of being able to solve a problem in any domain/industry of Data Science is to understand the data to the fullest. In most companies, there is an “80/20” rule that is followed whereas it is seen that 80% of a data scientist’s valuable time is spent simply finding, organizing, and cleansing the data, and only 20 percent to performing actual analysis on the prepared data.
The process or steps involved in data understanding can be as follows:
A) Creating a Data Dictionary
This step simply involves noting down what each column is trying to tell us through the range of values it consists. We can even get our data dictionary validated by the clients.
B) Looking out for Data Discrepancies
Here, we can identify the columns with missing values, NAs and outliers. This helps us in cleaning the data as well.
C) Getting Statistics of Numerical Columns
Through simple functions, we can get the distribution of numerical columns in the tables. We can look at the mean, median, deviation and variation.
D) Using Visualisations
Sometimes it gets tricky to understand a few columns as they might have a wide range spread across large values. Graphs and visualisations come in handy at this point of time.
When we mention that we need to conduct multiple sessions with the business/stakeholder to understand the data, there is also another aspect that we need to consider i.e. the business understanding. The understanding of data is incomplete without the business aspect, as we will be missing out on crucial contexts that we can acquire through the business.
Here, let us look into an example of a business problem to find the annual revenue, profits and losses and the year-on-year growth of stores across the country.
The problem, even though seemed straightforward, having interactions with the business/stakeholders will give us directionality as to what products to focus on, what stores to put more weightage on, and many such details which enable a data scientist to shape their analysis. The two stages above help us to fill in the blanks within the queries that we will want to work on as we progress with data and business understanding.
For instance, understanding the core transactions that happen in a business such as the telecom industry can help us interpret the data in a more holistic manner as we can understand factors such as average talk time, churned customers, and loyal customers better. To begin your journey in Data Science, enroll in KnowledgeHut Data Science and Machine Learning online course.
We can define profiling data as the process of analysing, examining, reviewing, and summarizing data sets to gain insight into the quality of data.
Once the data profiling step of managing the data with SQL with all its quality checks and ensuring the data is at its highest standard, the next step is to query the data using SELECT and FROM.
Upon completing the above three sections, you will be starting with SELECT. The SELECT command in SQL is an easier place to start your analysis. As the data would be new to look at, one can simply begin by starting with a single table first. Once you start getting comfortable with the analysis, more data and tables can be added to it and you can go back from there.
This leads to the next step of looking into the testing and troubleshooting of the queries that we create.
This step can also be called the QC step or the Quality Check. Herein, you will be checking the number of records your query will be returning. If the query included all the correct filters to get all the required data points to calculate any statistics, this step becomes crucial as getting results is a simpler task, but making sure it is accurate is of more importance.
The next stage is to be able to resolve any queries when we see any discrepancies in the outcomes or the queries themselves. By beginning to dissect the query step by step to identify the source of the issue, we can keep the procedure straightforward. This technique is also useful in that it will enable users to create subsequent queries that are more effective.
We all know the frustrations of having to decode a bug from a badly formatted code. It wastes an ample amount of time to debug it and even then, we are still left with badly written code.
A bit of digging into this aspect reveals that most programmers want code that simply works and also wants it with a quick turnaround, which makes formatting the last priority for any one of us.
Code Refactoring involves changing the factoring of already-written computer code without altering its external behaviour. Refactoring aims to maintain the functioning of the software while enhancing its design, structure, and/or implementation.
An effective method to avoid that messy code is to maintain two versions of your code, one which you keep to yourself for figuring out the code and the other version which is production ready for anyone else to take it up and also understand it easily. One adequate method to follow in SQL is to capitalize the commands in your statement and the rest of it can follow the lower case. Like the example shown below.
The next key suggestion will also be to comment on the codes that are written, this will make the code easily readable as the objective is already stated before the code. This can be seen in the example below. This will also ensure documentability on the code which can always be used for later purposes and will also make it easier for transition.
This forms the last step of the process and one of the most tedious parts of it as well. The entire effort that you put in until now will basically be reviewed from the top to bottom here. This will ensure you have captured all the data understanding and business rules effectively, conformed with the highest quality of data, refactoring, maintaining versions of the code and keeping it well documented will be crucial.
Until now we have understood how SQL can be used to effectively understand the data by using queries to define and manipulate the data. But what if we were to tell you that SQL can be used within various platforms as well such as R, Python, Hadoop and Spark.
Also, we need to be wary of the fact that SQL queries within R and python can only tell us about what has happened in the past and cannot predict future. So, their uses in these platforms are only limited to fetching, querying and understanding the data.
The general procedure we are going to look at for using SQL within R and Python is to be able to:
a. The DBI package can be used in the R package for setting up a connection to databases. The DBI package has options for connecting with popular databases. MariaDB, Postgres, SQLite, etc. are just a few of them.
b. Importing tidyverse library gives us access to dplyr, ggplot and default datasets, which we can use to fetch default tables. Or you can even create your own data sets by using the CREATE table syntax and then use INSERT to add in values.
c. DBI package can also be used to query the data by using dbGetQuery() function. You can do it simply by pasting your SQL code into the R function within a quoted string. This method is also referred to as pass-through SQL code. An example of this use case can be seen below:
Additionally, you can also use dplyr, by writing your code in dplyr with R syntax. dplyr has a feature that will convert your R code into SQL queries.
In this sub-section, we will be discussing how to manage databases and use SQL for Data Science with Python which can also lead to knowing that it is possible to use sequel (SQL) for data analytics.
a. SQLAlchemy is another popular python module that lets the user connect to all kinds of databases by providing a unified API. The process involves creation of an engine where you can specify the characteristics such as database name, username, password, host and port for that particular database.
b. Once the connection has been established to the preferred database, fetching all the tables within that database becomes a possibility or you can also go ahead and create tables on your own. SQLAlchemy also allows us to look at all the tables present in the database and hence helping you to fetch all the tables within it.
c. Two steps are needed to be performed using SQL Alchemy for you to be able to write SQL queries in python. The first one is to include your query within text() and then utilize the conn.execute() function to execute your query in that environment
d. A full example of a connection, fetching a table, and execution of a query can be found here,
In the next sub-section, we will be looking at how SQL finds its application in Apache Hadoop and Spark frameworks. But first, a quick context about both of these frameworks. The Apache Software Foundation developed both Hadoop and Spark, two popular open-source frameworks for big data structures. Each framework has a rich ecosystem of open-source tools for handling, managing, and analyzing large amounts of data.
SQL implementation on the Hadoop platform, known as SQL on Hadoop, combines normal SQL-style structured data querying with the Hadoop data architecture. Big data and Hadoop are both relatively new platforms, and few professionals are experts in either. However, SQL on Hadoop makes it simpler to access the Hadoop framework and easier to integrate it into existing enterprise systems.
One of the SQLs on Hadoop implementation include the widely popular Spark SQL module, which we shall elaborate in the below section.
A Spark module for processing structured data in Spark SQL. It offers the programming abstraction called DataFrames and also functions as a distributed SQL query engine. The features also include powerful integration with the rest of the spark ecosystem such as integrating SQL query processing with ML.
SQL is a relatively easier topic to start learning, which means it will require minimum prerequisites, to begin with. But, we shall outline a few topics that are good to know before you begin your journey to learn SQL.
In the above segment, we have showcased how to use SQL in R and Python to connect various popular databases. A good place to start with will be to understand the concepts of Relational Database Management Systems (RDBMS). You can explore the following concepts along with understanding what relational model is used for building relational databases:
These concepts will give us a clear picture of managing data with SQL and the process of creating tables. Another important concept that would make it easier for us to retrieve data in a fast manner is understanding how indexing works in SQL. Alongside, you can check KnowledgeHut data science course details.
You can use rely on this section to build your first table.
We shall be using the CREATE command over here and also have a look at its components.
Components of a SQL CREATE statement
Executing this above statement will create a new table in your environment. See, it’s that easy. Now go ahead and try out this query over here.
Now the next step is to go ahead and add data to the table that you have just created. This step can be accomplished by using the INSERT command.
One should be careful while using the INSERT statement to insert the respective values into their correct positions w.r.t to their columns assigned during the CREATE statement.
Now that we have created our first table, the next step will be to retrieve this data. This will be a straightforward process of querying the table with its name using the SELECT and FROM commands.
We can retrieve data by using the SELECT and FROM commands
When a table consists of multiple rows, we can use the WHERE command along with SELECT to retrieve specific rows. The WHERE clause is generally used with a combination of various conditions for filtering the data while retrieving it from either created tables or retrieving tables from databases.
To look at the entire flow of the code at once,
It is crucial to include comments in the SQL queries that you write. It improves the following aspects:
We will be looking at two methods to add comments to our code:
Commenting on your code becomes essentially necessary as you start writing more advanced queries. As codes start becoming more complex, adding these blocks will help you a great deal in comprehending your own code.
As we had mentioned that being a data scientist you will spend almost 80% of your time understanding your data. Here we are going to introduce a few concepts that will help you approach this process:
You might have already observed that we have used the WHERE clause in a few of our examples above. WHERE clause within SQL helps us filter our data from the database at a level that you will want to see your data. It basically filters records from a table that fulfil a specific condition specified by you.
For instance, let us say from your Students (A table containing student records with their test scores) table you wanted to have a look at the student scores, but only want to focus on those who scored greater than 75. You also want to sort them in descending order to rank them, then your query will look like the following chunk:
The ORDER BY clause is simply used for sorting data in either ascending or descending order. The default setting for ORDER BY is sorting the values in ascending order. The ORDER BY clause must be included only after the filtering clause of WHERE as stated by the order of execution in SQL.
Another direction we can look into for SQL’s functions is the aggregation section. Using the same example above, you want to aggregate scores for a student across all the subjects. Use aggregate functions as shown below:
Aggregate functions are used with GROUP BY clause which will create groups of specified columns and apply the aggregation functions to them.
Here, we would strongly recommend you understand order of execution in SQL (which we shall introduce in the next section). Order of execution in SQL tells us exactly in what order will a query be performing its actions even though it’s written in a standard SQL format. Understanding this will enable us to write efficient queries.
Order of Execution
One of the most often used clauses in SQL is "GROUP BY". It can be used to aggregate data, and there are numerous ways to make it work effectively for you. The different ways you can think of using GROUP BY with one or multiple columns are:
A. Use ORDER BY for sorting your data post application of GROUP BY
B. HAVING clause is specifically used with GROUP BY to filter the aggregated/grouped data
These are a few tips you can begin with and dive deeper into using GROUP BY with more complex queries.
Often you might require to analyze data tables that might contain information in more than one table. It becomes a tedious task to look into both of these tables separately to come up with an analysis. To save your time with this, there exist multiple types of joins.
Joins indicate how the SQL server should use the data from one table to select the rows from another table. The different types of joins that you use are:
Mastering Joins in SQL will help you deal with multiple tables at once post the application of joins. Although, one aspect you need to be extra careful about is the concepts of types of keys in SQL which are primary, secondary, foreign, composite, super, and alternate keys. Knowing them will help you in effectively using Join statements.
A Sub-query or a Nested query is a query within another SQL query and is embedded within the WHERE clause. Subqueries come in handy when the result that you are looking for requires more than a single query and each subquery provides a subset of the table involved in the query.
Python user-defined functions or UDFs are quite well known around and used extensively in the Python environment. SQL's equivalent of User Defined Functions is Stored Procedures. A stored procedure is basically a prepared SQL code that can be saved and used multiple times to perform the same action. Here is an example below:
i. Here we show the syntax to create a stored procedure for simply retrieving all the data from CustomerTable
ii. Secondly, how to call the stored procedure
With the help of indexes, we can load the data easily onto the database. Indexes also help us with special lookup tables. So, including indexes can add benefits while querying the data.
Null values can cause issues in the future if they are not dealt with as part of the data preparation step. We might face issues like not being able to use formulas, calculations and other analyses to get accurate results. So, it becomes necessary to understand null values and deal with them by applying business context.
To create indexes which help us to retrieve data we can do the following:
Moving on to the final element here, we would like to highlight how you can master SQL specifically for Data Science. Fundamentally, Data Science is all about understanding data and using it to create impactful insights through analysis and modelling using AI/ML that can drive decisions to bring about changes in behaviour. So where does SQL fit into this equation here?
We know that Data Science pipeline consists of a crucial step to creating a dataset that will enable one to conduct the analysis, run ML models, and gather findings and insights. It is also called as the ‘Analytical Dataset’ or ADS. There have been multiple occasions where one might have wondered how easy it might have been to utilize SQL to create the analytical dataset and move it further.
A Key is a single or combination of multiple fields in a table. Keys are an attribute of database management systems. This plays an important role to connect and create relationships between two or more tables Well, this is the definition, in simpler terms to communicate with different columns of different tables in databases we use keys and its property.
Primarily, we have Seven types of keys as follow.
SQL is a domain-specific language used for storing, manipulating, and reclaiming data in databases. SQL stands for Structured Query Language and works with many relational database systems like MySQL, Oracle and SQL servers. With the introduction of big data, QL plays important role in relational databases.
In this article, we have gone through immense details on how you can master SQL for Data Science. One of the main reasons why SQL is still so prevalent is its integration with other languages such as Python, Spark, R and Hadoop.
Given the extensive application in various enterprises, languages like Python, Spark constant evolving usage across multiple domains is helping SQL stay the #1 choice for most data-related interviews. A better way to learn the fundamentals of SQL for Data Science is with the support of KnowledgeHut data science with Python courses.
Avail your free 1:1 mentorship session.