How to Master Excel for Data Science

Read it in 15 Mins

Last updated on
14th Mar, 2022
Published
10th Mar, 2022
Views
5,667
How to Master Excel for Data Science

Excel is the most powerful and well-known tool for working with data. It is also one of the oldest tools — being in use for decades for handling the operations related to data. Even though several new tools are available for performing basic operations on data, Excel for Data Science has a greater significance to perform the basic operations and helps beginners to draw some insights from the given data.   

Excel is a replication of a Table or a group of tables to load and work with data. In general, it is termed as a Spreadsheet which is valuable for data pre-processing. Excel Spreadsheets handle large amounts of data and provide reliable results on the data as the output and are also an affordable tool for data analytics.   

Many industries and most of the businesses in the real-world use Excel skills and emphasize the importance of those skills because it resides as an intelligent process to bring out the insights from the data. Marketing data, weather patterns, accounting information, etc., most of the real-world scenarios use the Excel application to analyze the data and that’s how it is useful. If you are really interested in learning data science with programming, you could check out a Data Science with Python Course and understand the real-world applications of Data Science.  

With the help of an Excel Spreadsheet, we can work with some amount of data to gain better insights, but in real-world scenarios, when we work with a large amount of data in GigaBytes or Tera Bytes of memory this would not be suggestible. In that case, we move on to Big Data, Hadoop, or Cloud where a large amount of data is handled and the same operations were performed as in Excel. 

How Excel is useful in Data Science

Excel Spreadsheets helps developers to build the foundational structure on data as it helps to understand the analytical approach to gain insights from it. To better understand the data science course and to have a real exposure to Data Science Platform, completing this certification course provided by KnowledgeHut Data Science with Python help learners to know better about Data Science with real-world applications. Using of Excel is restricted to entry-scale data products, where it can provide basic visualizations and accurate results on the given data.   

If you wanted to become a Data Analyst, where you need to handle a large amount of data and build a report according to the given data, this EXCEL SPREADSHEET would help you to understand the basic insights of data. Excel is the tool for analyzing the given data and it is not only the ultimate solution provider. Executing data science using excel helps us to create Dashboards, generate visualizations about the data, summarize the data, sort the data, segment the complex data into simpler ones etc.,   

Excel helps to solve many of the data science problems in various industries like Automotive, Health Care, Retail, Financial Industries and so on. Excel is used in developing simple level applications recommendations, fraud detections etc., to the complex level applications like building Self Driving Cars. It is recognized as the most powerful tool of Data Science. 

Best Editor for 2D Data:-  

2D Data is the symbolic representation of a large amount of data in rows and columns. It is well tabulated with all the necessary headings of the columns and with the appropriate data that describes a certain problem statement to resolve. Excel is considered the best editor for 2D data because we can perform many operations like addition, deletion, editing, formatting, colouring, sharing of the data and many more operations can be carried out. Google sheets are like the excel sheets which work in online mode and replicate all the functionalities present in the normal offline Excel sheet present in the system. 

How to Master Excel for Data Science

Fig: - Excel Spreadsheet

**Excel is a Data Computational tool, which is the most basic & powerful tool, works with formulas and doesn’t require any coding to perform operations.**  

Excel is considered to be the best editor for 2D data in the field of data science because it helps the developer to perform many more operations in it. It is useful to analyze the data, Visualizing, generate dashboards, helps to perform the functions like filter, sort, ranges, charts, Formatting based on condition, date and time functions, pivot tables, and so…on.   

Platform for Advanced Analytics 

As we know, Excel helps in analyzing the given data i.e., it is capable of handling Data Analysis. This Data Analysis is a technique that has the capability to analyze data and helps developers to make better decisions. It is the process of collection of data, modelling, analyzing and exploring the data. This is the basic step carried out in a Data Science application.   

Excel in Data Science helps to solve many real-world problems. Microsoft looks at Excel as one of the great platforms for Advanced analytics because it has ANALYSIS TOOLPAK for performing operations on data. Some of the algorithms like Linear Regression, Logistic Regression which we use for solving Classification and Regression problems in data science, can be solved using this Excel.   

Some Important Keyboard Shortcuts for Excel Data Analysis  

  1. Ctrl + Pg Up/Pg Dn : To Shift between the Spreadsheets  
  2. Alt + A + S + S : For sorting the data  
  3. Tab : To go to the next cell  
  4. Shift + Spacebar : Select entire row  
  5. Ctrl + Spacebar : Select entire column  
  6. Alt + Shift + Right Arrow: To group the rows or columns  
  7. F11 : To generate pivot table on a new sheet  
  8. Alt + F1 : To create a pivot table in the same sheet  
  9. Ctrl + H : To Find and Replace  
  10. F2 : To edit a cell  

Scripting in Excel 

To perform operations on the data present in Excel, we can do it using mathematical formulas by using equal to notation (=formulae). We can carry out the analysis and visualization tasks in some other programming languages like Python, R etc.,

Dataset for Data Analysis  

Let us look at an example by considering a small dataset and performing the operations of data analysis on the selected dataset using a Scripting language and Excel Spreadsheet. Most of the datasets for solving Data Science Problems are available in repositories like Kaggle, Github, etc., we consider a small dataset for performing various functions of Data Analysis.   

I have made a small Excel data replicating some of the courses offered by a training company to students, professionals and many people. I have tabulated this data by checking the course list from the training company’s main website. I have considered 3 fields like the “Career Path, No of Courses, No of students Enrolled” for a particular career path in the training company.  

How to Master Excel for Data Science

Fig: - Course list from a training company

** Every time you work with a dataset, ensure whether the taken dataset is clean or not….?? Whether it contains null values or missing values……?? And perform the necessary validation and data cleaning steps to get the dataset into the proper format and is readily available for performing operations on it. **  

Regarding the training company’s Courses dataset, we have considered, the data is clean and validated. It contains all the proper data and no null values or missing values are present in it, so there is no need for us to perform any data cleaning steps or the validation steps on the dataset that we have considered. We will try to perform some basic excel operations on the above dataset.  

Excel Tricks for Data Science 

We will look at some Excel tricks to perform some operations on the given dataset. This blog helps readers to upgrade their Data Analytical skills in the field Data Science and helps them to understand some basic insights of Data.   

  1. VLOOKUP( ):- The function VLOOKUP means “Vertical Lookup”. It is used for performing Search Operations in an Excel Spreadsheet in order to retrieve a value from a column or from a Table Array. The search is performed on the different columns present in the same row.   

This function has 4 arguments to be considered to get the output at the end.  

  • The Look up Value  
  • Range of data to search or the Table Array and find the return value or   
  • Column number in the specified range  
  • We give False or 0 for an exact match with the given value; True or 1 for the approximate match of the given value.  

Syntax of the VLOOKUP( ) function:-  

** VLOOKUP ([value to lookup]; [range of data / table array]; [column number]; [true / false]) **  

Let us apply this function on the above training company’s dataset to find No of Students applied for Data Science Course.   

  • For the above dataset, the Excel Formulae is VLOOKUP (A4,A3:C13,3,0)  

Where the parameters are explained as follows: -   

  • A4 – selects the Course Name Data Science  
  • A3:C13 – selects the Table Array  
  • 3 – Indicates the Column index value of the Course Name (Data Science)  
  • 0 – Is used to find the exact match and FALSE can also be applied 

How to Master Excel for Data Science

Fig: - Application of VLOOKUP function on training company’s Dataset.

When we apply the VLOOKUP function on the above dataset, for finding the No. of Students Enrolled in the Course of “DATA SCIENCE”, we have obtained the result of 30065, which represents the exact match in the output value from the given dataset.   

  1. CONCATENATE( ): - The function CONCATENATE means to join or to Combine together. This function helps to combine the data present in two or more cells. Whether it may be the text data or the numerical data, this function combines the data present between two columns. We can add some spaces in between the combination using the double codes like “ ”, “-” etc., 

Syntax of the CONCATENATE( ) function: -   

CONCATENATE (text1, text2, text3, ….)  

Let us apply this CONCATENATE function on the above training company’s dataset to see its functionality.  

How to Master Excel for Data Science

Fig: - Application of Concatenate Function on the training company dataset

  • For the above dataset, the Excel Formulae applied is CONCATENATE(A3,“  -  ”,B3)  

Where the parameters are explained as follows: -   

  • A3 – selects the Career Path data named Agile Management   
  • “  -  ” – Helps to separate two strings.  
  • B3 – select the No of courses whose value is 77.  

When the CONCATENATE function is applied to the dataset, the values from the corresponding Career Path Column and No of Courses column combine together with a “  -  ” string literal, in order to help the readers understand what has happened by applying that function.   

  1. LEN( ): - The function LEN is used to find the length of a cell. It counts the number of characters present in that cell.  

Syntax of LEN( ) function: - LEN(cell)   

Let us apply this LEN function on the A1 cell of the above Dataset, to see its functionality  

How to Master Excel for Data Science

Fig: - Application of LEN Function on the Al Cell

  • For the above dataset, the Excel Formulae is LEN(A1)  

Where the parameter   A1 – is the Cell  

When the LEN function is applied on cell A1, it has counted the characters present in the sentence including spaces also.   

  1. UPPER( ), LOWER( ), PROPER( ): - These three functions named UPPER, LOWER, PROPER are used to convert the text to corresponding formats.   

  • UPPER function converts the text present in a particular cell to UPPERCASE  
  • LOWER function converts the text present in a particular cell to lowercase  
  • PROPER function converts the text present in a particular cell to Proper Case (or it is called the Sentence Case)  

Syntax of the 3 functions: UPPER(cell) / LOWER(cell) / PROPER(cell)  

Let us apply these 3 functions on the cells A3:A13 of the above training company Dataset to see its functionality  

How to Master Excel for Data Science

Fig: - Application of UPPER, LOWER, PROPER functions on cells A3:A13

  • For the above functionalities, the excel formulae are UPPER(A3), LOWER(A3), PROPER(A3)  

Where the parameter is the A3 cell.  

  • When we apply the UPPER function on the A3 cell – All the characters of the word Agile Management got converted to AGILE MANAGEMENT.  
  • When we apply the LOWER function on the A3 cell – All the characters of the word Agile Management got converted to agile management.  
  • When we apply the PROPER function on the A3 cell – All the characters of the word Agile Management got converted to Agile Management.  
  1. TRIM( ): - The function TRIM helps to remove the white spaces in the cell. When we are dealing with the text data to solve a Data Science problem, we apply this function to extra spaces in the cell data.   

Syntax of TRIM function: - Trim(cell)  

Let us apply this TRIM function on the cells from A3:A13 of the above Dataset to see its functionality  

How to Master Excel for Data Science

Fig: - Application of TRIM function on the cells A3:A13 of the dataset

For the above functionality, the excel formulae are TRIM(A3)  

Where the parameter is the A3 cell.  

In the above picture, when we apply the TRIM functionality on the dataset there is no much difference between the A3:A13 and E3:E13 because there are no white spaces present between the words that were given in the dataset.   

  1. IF( ): The function IF is used for decision making. It is one of the most used functions in Excel. It has two conditions to perform, if the decision is correct it prints the True Output and if the decision is incorrect it prints the False Output.   

Syntax of the IF( ) function: - IF(condition, True Statement, False Statement)  

Let us apply this IF function on the above dataset to see its functionality  

How to Master Excel for Data Science

Fig: - Application of IF Function on the dataset.   

For the above dataset, the Excel formulae applied is IF(c3>100000,“Valuable Course”, “Good Course”)  

where the parameters are explained as follows: -  

  • C3>10000 – It is the condition in the IF function, we are checking the students enrolled in a particular course is greater than 10000 or not  
  • Valuable Course – It is the True Statement, if the condition is satisfied, the true value named Valuable Course is printed as output  
  • Good Course – It is the False Statement, if the condition is not satisfied, the false value named Good Course is printed as output.  

In the given dataset, based on the students enrolled in a particular course, the IF condition is applied and the result is tabulated in the above table picture. If the number of students enrolled in a particular course is greater than 10,000 then the true statement is executed and the name “Valuable Course” is given as output. If the number of students enrolled in a particular course is less than 10,000 then the false statement is executed and the name “Good Course” is given as output.   

PIVOT TABLE: - When we are working with a large amount of data, we need to answer some of the questions like “How much is the total sum of data…?”, “what is the average…?” and so … on.   

Pivot Table is a tabular format that resembles the operations like Average, Count, Sum and many more operations are obtained using this Excel pivot table. These operations are performed to the columns that we have selected from the dataset. This Pivot Table helps us to make decisions by converting the normal data table into an Inference Table.   

Steps to Create a Pivot Table: - 

Step 1:- Open an Excel sheet where you have some data with headings to the columns and some values for it. Click on a data cell, and then click on Insert Tab, thereafter choosing the Pivot Table option under it. A prompt box will be generated to create a pivot table. In the prompt box we select the option named “select a table or range” and enter the data values of data including the Column headers, after that we need to choose an option to print the pivot table, either in the same spreadsheet or a different spreadsheet and then clicks on ok. It is suggested to select the printing of pivot table in a new spreadsheet, for better understanding.   

Let us apply the generation of the Pivot Table in the dataset and see the output and by applying this 1st step, the result is as follows: -   

How to Master Excel for Data Science

Fig: - Applying Step 1 for generation of Pivot Table in the dataset.

Step 2: - After clicking on the Ok button in the above picture, you will be redirected to another page, where you will be asked to select the columns to generate a pivot table. This pivot table is best suited to perform operations on the numerical data. After applying this step 2 on the training company’s dataset, the output is as follows:-  

How to Master Excel for Data Science

Fig 2: - Applying the Step 2 on the training company’s dataset.   

When we tick the checkboxes present on the right side in the above picture, a resultant pivot table will be displayed on the left side of the new spreadsheet with a resultant data of Grand total at the end of it.

How to Master Excel for Data Science

Fig: - Pivot Table is generated for the training company’s dataset.

When we check the boxes to the fields of Career Path, No of Courses, Students Enrolled we could see that we got the resultant as the sum of values for the fields of Courses and students enrolled. But for the field named Career Path, the courses were rearranged in an Alphabetical manner. In the pivot table, we can also get the values for other functions like min, max, average, count etc., that can be applied on the numerical data.   

  1. Creating Charts: - Visualization of data is the most important area when we are working with data. It is because sometimes we cannot understand the whole data given to us; it may be due to its size, complexity and many other factors. So we try to visualize the data to draw better insights from it. “A Picture speaks more than 1000 words”. In the same way, when we visualize things we understand them much better. We can create charts in Excel using the F11 key or using the command ALT + F1.  

By using the ALT+F1 key, we can generate the chart in the same spreadsheet that we are working on. And by the command F11, we can generate the chart in the other spreadsheet which looks good for better understanding. Charts are created for the numerical data representing the range in some diagrammatical notations like bar charts, histograms etc., based on the type of dataset we work on.   

Let us generate a chart for the dataset in a learning company, and see the functionality as follows.  

How to Master Excel for Data Science

Fig: - Chart Diagram is drawn between Career Path and No of courses from the learning company's dataset.

How to Master Excel for Data Science

Fig: - Chart Diagram is drawn between Career Path and Students Enrolled from a training company’s dataset.

  1. Removing Duplicates from Dataset: - When we are working with large amount of data, we might face the problem of duplicates in the creation of a Data Science Model. In the Excel Spreadsheet, we have a unique feature of removing null values from the data. In Excel we navigate to the Data tab and click the icon Remove Duplicates, a prompt box will appear which consists of the column headings in the given dataset. By checking the required checkboxes of columns, the duplicate values get removed from the original dataset and we will have clean data to work on.   

Let us apply this removing duplicate values option on the learning company’s dataset, and try to see the result as follows: -   

How to Master Excel for Data Science

Fig: - A prompt box asking to check the null values in the particular columns of a training company’s dataset.

How to Master Excel for Data Science

Fig: - After checking the required columns boxes, we see the result about the duplicate values from a training company’s dataset.

Conclusion:  

In this article, we have seen how Excel is used to perform the Data Analysis Operations, which is the basic step to solve any Data Science Problem. We know that Excel is a powerful tool to compute the basic analysis operations on the dataset and help us to draw insights from it. It is a power-packed tool, which we are using for ages and has many more functionalities to work with any type of numerical data. We can generate Pivot Tables, create charts, remove duplicate values and many more operations we can perform on Excel Spreadsheet. Check out this excel data science course on Financial Modeling to get a hands-on learning experience. 

Frequently Asked Questions(FAQs)

1. Can Excel be used for Data Science

Yes, Excel can help us know basic insights about the dataset that we consider. It is not the ultimate solution provider, but we can get the insights as a starting point. We can work with Excel only with a small amount of data, if the data is too huge then it cannot be addressed in Excel in that case you need to work with higher-end technologies like Big Data, Hadoop, Cloud etc.,

2. How do you do Data Science in Excel

In Excel we apply some of the functionalities like sum, count, min, max, average etc., and many other functions, which are similar to the Data Cleaning Step while solving a Data Science problem.

3. Should I learn Excel for Data Analysis

It is not mandatory, but it is always advised to learn it because we can know better insights about the given dataset before we find a solution to the Data Science Problem.

4. What is the best way to learn Excel for data analysis

Start working on Numerical and Small datasets initially and try to apply various functionalities and then move on to working with larger datasets.   

Profile

Harsha Vardhan Garlapati

Blog Writer at KnowledgeHut

Harsha Vardhan Garlapati is a Data Science Enthusiast and loves working with data to draw meaningful insights from it and further convert those results and implement them in business growth. He is a final year undergraduate student and passionate about Data Science. He is a smart worker, passionate learner,  an Ice-Breaker and loves to participate in Hackathons to work on real time projects. He is a Toastmaster Member at S.R.K.R Toastmasters Club, a Public Speaker, a good Innovator and problem solver.