How to Use SQL for Data Science?

Read it in 11 Mins

Last updated on
29th Sep, 2022
Published
22nd Jun, 2022
Views
375
How to Use SQL for Data Science?

Well, data science is making its impact on every industry. You must have heard its applications and achievements. Many organizations are now actively using data science to solve business problems like sales, revenue improvements, identify suspicious transactions, task priority, etc. Have you ever thought about how these big organizations handle such big scale data?

The answer is database! 

To handle such data, organisations save data in different databases for procurement, sales, revenue, customer data, etc. to manage these databases we use SQL.

Why SQL? Why Not Python or Java?

SQL and Python/java are not in competition. Python and Java have achieved excel in different domains, but both are general-purpose languages. SQL is working with relational databases. 

What are Relational databases? 

A relational database is a Digital Database that is dependent on a relational model of data. 

Now What is Relational model? 

This model arranges data into tables (or relations) of rows & columns. To identify each data row, relational databases use a unique key. Normally, each table keeps one unit of information (a table) that contains customer data, address, previous customer purchases, etc., respectively.

Keys in SQL

A Key is a single or combination of multiple fields in a table. Keys are an attribute of database management systems. This plays an important role to connect and create relationships between two or more tables Well, this is the definition, in simpler terms to communicate with different columns of different tables in databases we use keys and its property. 

Primarily, we have Seven types of keys as follow. 

  • Super key: It is a collection of single or more keys used to point out a certain type of information in a table.
  • Candidate key: It is a collection of single or more columns that can point out a row uniquely in a table. 
  • Primary key: It is a collection of single or multiple fields/columns of a table that uniquely identify a record in a database table. It cannot accept null, duplicate values. Only one Candidate Key can be Primary Key in table. 
  • Alternate key: It is a key that can work as a primary key. However, a candidate key could not be used as a primary key.
  • Composite/Compound key: It is a union of multiple fields/columns of a table. It can be a Candidate key or a Primary key. 
  • Unique key: It is constraint which makes sure that all values in a column are different.
  • Foreign Key: It is an attribute in a database table, which is the Primary key in a different table. It can accept multiple nulls and duplicate values.

SQL is a domain-specific language used for storing, manipulating, and reclaiming data in databases. SQL stands for Structured Query Language and works with many relational database systems like MySQL, Oracle and SQL servers. With the introduction of big data, QL plays important role in relational databases.

Role of SQL in Data Science World

Data science is a multidisciplinary field that uses scientific techniques, processes, algorithms to draw out information from data. Sources of data grow with innovations like 5G, which will create a new standard in wireless technology in data collection and its application in different devices. Recent development like this increase potential of data science. Hence, building in-demand skills in Data Science is important. To equip yourself with these career-building blocks can be followed at Science Bootcamp Review.

Data can be noisy, structured, or unstructured. Here, we are concentrating on Structured data which is normally stored in databases. So, to work around the data we use SQL queries to get the desired data.

Let’s have a look why we need SQL?

  • The initial steps of data science project are to define objectives or asking correct questions to business problem. After finalising objectives, we acquire data which is useful for solving the problem. Hence a data scientist should know SQL commands. 
  • To gamble around big data platforms like Hadoop and spark which provide an add-on for SQL commands. 
  • Relational databases like Oracle, Microsoft SQL, MySQL need SQL to perform analytics functioning. 
  • SQL is used in the selection of data for operation, data preparation. 
  • To communicate with databases with scripting language like python or java, SQL is used with extensions. 

Now we will take a deep dive and look into the factors that can help you become a SQL expert. Besides the primary qualification, you must have proper education. On that part you need Industry oriented and globally acknowledge courses like Best Data Science Courses, which are helpful in this case.

Get to know more about Data Science Career path

SQL Statement Fundamentals

In this section we will focus on basic SQL syntax, which will be useful in almost all your future SQL queries.

A query is a request for details from a database using a merger of statements. In general, we will focus on the syntax for constructing a SQL query.

For syntax reference we will follow following figure: 

  • SELECT: It is the most common statement which allows retrieving information from the database table.[2] 

Example syntax for SELECT statement: 

1. SELECT RollNo, Name FROM Table1 
2. SELECT * FROM Table1 

In general, it is not a good practice to use an asterisk (*) in the SELECT statement if you don’t really need all columns. It will impatiently query everything, which increases congestion between the database server and the application, which can slow down the reclaiming of results. If you need certain columns, only query those columns. 

  • DISTINCT: Most of the time a table contains a column that has duplicate values and if you need list of unique/distinct values. The DISTINCT keyword is used to get only distinct values from a column.[2] 

The DISTINCT keyword applies on column, syntax as follow: 

3. SELECT DISTINCT column FROM table 
4. SELECT DISTINCT (Address) FROM Table2 

Code line 6 will give output like this-

  • COUNT: The COUNT attribute outputs the number of input rows equal to a condition of a query. We can apply COUNT on a certain column or can use asterisk (*), we will get the same results.[2] 
5. SELECT COUNT (Name) FROM Table1 

Code line 7 can simply return the number of rows in the table, it is same regardless of the column. 

6. SELECT COUNT (Name) FROM Table1 
7. SELECT COUNT (ID) FROM Table1 
8. SELECT COUNT * FROM Table1 

Code lines 8,9,10 returns the same thing, since original table had 6 rows. 

  • WHERE: It is one of the fundamental SQL statements. Along with SELECT it is used regularly. The WHERE statement helps you to apply condition on columns for the rows to be returned. [2] 

To apply the condition, we will first study comparison Operators & Logical operators. 

Logical Operators allow us to combine multiple comparison operations 

  • AND 
  • OR 
  • NOT 
9. SELECT COUNT (Address) FROM Table2 
10. WHERE Address = ‘Delhi’ 

Code lines 11, 12 will return following output:

11. SELECT COUNT (Address) FROM Table2 
12. WHERE Address = ‘Delhi’ AND EnrollNo = ‘AX105’ 

Code lines 13,14 will return following output: 

  • ORDER BY: It is used to sorting records based on column value, in ascending or descending order. [2] Basic syntax for ORDER BY, 
13. SELECT column_1, column_2 
14. FROM table 
15. ORDER BY column_1 ASC/DESC 
16. SELECT Address 
17. FROM Table2 
18. ORDER BY Address DESC

Code lines 18, 19, 20 will return following output: 

Note: We use ORDER BY towards the end of the query, as we want to do any selection or filtering first, before doing the sorting. ASC is used for ascending order and DESC is used for descending order, by default ORDER BY uses ASC. 

  • Limit: The LIMIT command deals with limiting number of rows returned for a query. To get idea of table columns and first few records LIMIT command is used. LIMIT also becomes handy when used with ORDER BY.[2] LIMIT goes at the very end of the query request, and it is the last command to be implemented. Basic syntax of LIMIT:
19. SELECT column_1, column_2 
20. FROM table 
21. ORDER BY column_1 DESC 
22. LIMIT (number) 
23. SELECT Address 
24. FROM Table2 
25. ORDER BY Address DESC
26. LIMIT 4 

Code lines 25, 26, 27, 28 will return following output: 

  • BETWEEN: The BETWEEN operator can be used to equal a value in a range of values. [2] Basic syntax for BETWEEN 
27. value BETWEEN low AND high 

The BETWEEN operator is same as, 

28. value >= low AND value <= high 
29. value BETWEEN low AND high 

We can also use BETWEEN with NOT logical operator: 

30. value NOT BETWEEN low AND high 

The NOT BETWEEN operator is the same as, 

31. value < low OR value > high 
32. value NOT BETWEEN low AND high 

The BETWEEN operator can also be used with time and dates.

Note: The requirement of date format to be the same as ISO 8601 standard format which is YYYY-MM-DD. 

33. date BETWEEN ‘2007-01-01’ AND ‘2007-02-01’ 
  • IN: In some problems we want to check for multiple possible values, [2] for example, if a names present in guest list? 

We can use the IN operator to define condition to see if a value in present in a column or many options. Basic syntax for IN, 

34. value IN (option1, option2, …, option_n) 
35. SELECT Address 
36. FROM Table2 
37. WHERE Address IN (‘Delhi’, ‘Noida’) 
  • LIKE and ILIKE: We have seen the direct comparison in Where statement using equal comparison operator. But what if we want to see equal general pattern in a string? [2] 

The Like operator helps us to perform matching operation with string data with the use of wildcard characters: 

  • Percent %: Matches any sequence of characters
  • Underscore _: Matches any single character 

For example, all email ending with ‘@gmail.com’ or names that begins with ‘S’. Basic Syntax for LIKE and ILIKE, 

38. WHERE name LIKE ‘A%’ 
39. WHERE name LIKE ‘%a’ 

Note: LIKE is case-sensitive and ILIKE is case-insensitive 

SQL friendship with Python

Most of the time after objective definition data scientists start to e selecting columns that are important for solving a business problem from an enterprise database. Here to connect SQL databases to applications defined in python we use various packages. These packages make simple for data selection and operation in databases using python and SQL combined.

Let’s check similar connection with MySQL database and python application.[3] 

40. import mysql.connector 
41. # Create the connection object 
42. myconn = mysql.connector.connect(host = "localhost", user = "root", passwd = "google", database = "mydb") 
43. # printing the connection object    
44. print(myconn) 
45. # Creating the cursor object 
46. Cur = myconn.cursor() 
47. print(Cur) 

Output:

<mysql.connector.connection.MySQLConnection object at Ox7ytSS42edd870> MySQLCursor: (Nothing executed yet)

Wait it's not done yet, if you want to learn more do follow KnowledgeHut Data Science Bootcamp Review. For in-depth and complete learning. 

Conclusion

In This article, we talk about general SQL introduction, fundamental commands used to create queries. This article is ideal for a beginner. In future articles, we will discuss advanced commands and cloud-based database connection, and query implementation.

Frequently Asked Questions (FAQs) 

Do we use all keys mentioned in this article?

Normally we only use Unique key, Primary key and foreign key. The constraints are applied while the functionality of keys is defined in the table. Like Primary keys must contain UNIQUE values and cannot contain NULL values. 

What is DBMS?

A Database Management Systems (Or DBMS) are software systems used to store, manipulate and run queries on data. A DBMS is bridge between end-user and database, allowing users to perform SETL operations. 

What is data lake?

A data lake is a repository of data stored on its natural/raw format. Data lakes are used to simplify data management, speed up analysis and improved security and governance. 

What is difference between data lake and warehouse?

Data lake and warehouse are used for storing big data. A data lake is storage for raw data, whose purpose is yet to conclude. A data warehouse is structured, filtered data designed considering a purpose. 

Profile

Sushil Deore

Author

A city dweller who loves to travel, interested in outdoor activity most on road trips also, A data science professional with a liking for deep learning.