Search

10 Mandatory Skills to Become a Data Scientist

The data science industry is growing at an alarming pace, generating a revenue of $3.03 billion in India alone. Even a 10% increase in data accessibility is said to result in over $65 million additional net income for the typical Fortune 1000 companies worldwide. The data scientist has been ranked the best job in the US for the 4th year in a row, with an average salary of $108,000; and the demand for more data scientists only seems to be growing. Who is a Data scientist? A data scientist is precisely someone who collects all the massive data that is available online, organizes the unstructured formats into bite-sized readable content, and analyses this to extract vital information about customer trends, thinking patterns, and behavior. This information is then used to create business goals or agendas that are aligned to the end-user/customer’s needs.  This outlines that a data scientist is someone with sound technical knowledge, interpersonal skills, strong business acumen, and most importantly, a passionate data enthusiast. Listed below are some mandatory skills that an aspiring data scientist must develop. 10 Mandatory Skills to Become a Data Scientist Technical Skills  Programming, Packages, and Software Since the first task of data scientists is to gather all the information or raw data and transform this into actionable insights, they need to have advanced knowledge in coding and statistical data processing. Some of the common programming languages used by data scientists are Python, R, SQL, NoSQL, Java, Scala, Hadoop, and many more.  Machine Learning and Deep Learning Machine Learning and Deep Learning are subsets of Artificial Intelligence (AI). Data science largely overlaps the growing field of AI, as data scientists use their potentials to clean, prepare, and extract data to run several AI applications. While machine learning enables supervised, unsupervised, and reinforced learning, deep learning helps in making datasets study and learn from existing information. A good example is the facial recognition feature in photos, doodling games like quick draw, and more. NLP, Cloud Computing and others Natural Language Processing (NLP), a branch of AI that uses the language used by human beings, processes it and learns to respond accordingly. Several apps and voice-assisted devices like Alexa and Siri are already using this remarkable feature. As data scientists use large amounts of data stored on clouds, familiarity with cloud computing software like AWS, Azure, and Google cloud will be beneficial. Learning frameworks like DevOps can help data scientists streamline their work, along with several other such upcoming technologies. Database knowledge, management, and visualization A collection of information organized to easily access, manage, and update the data is called a database. Data scientists must have a strong database knowledge and use its different types to their advantage. Some examples include relational databases like SQL, distributed database, cloud database, and many more. Once this is expertise is established, analyzing the data, database management, and data visualization are also important skills.  Domain knowledge  Domain knowledge about the domain in which data is to be analyzed and predictions will be made is important. One can harness the true power and fullest potential of an algorithm and data only by having specific domain language. Instead of waiting to analyze the data, the goals itself can be shaped towards actionable results with the help of domain knowledge.  Non-technical Skills Communication skills As explained above, once the raw data is processed, it needs to be presented understandably. This does not limit the job to just visually coherent information but also the ability to communicate the insights of these visual representations. The data scientist should be excellent at communicating the results to the marketing team, sales team, business leaders, and other stakeholders. Team player This is related to the previous point. Along with effective communication skills, data scientists need to be good team players, accommodating feedback, and other inputs from business teams. They should also be able to efficiently communicate their requirements to the data engineers, data analysts, and other members of the team. Coordination with their team members can yield faster results and optimal outputs. Business acumen  Since the job of the data scientist ultimately boils down to improving/growing the business, they need to be able to think from a business perspective while outlining their data structures. They should have in-depth knowledge of the industry of their business, the existing business problems of their company, and forecasting potential business problems and their solutions. Critical thinking Apart from finding insights, data scientists need to align these results with the business. They need to be able to frame appropriate questions and steps/solutions to solve business problems. This objective ability to analyze data and addressing the problem from multiple angles is crucial in a data scientist. Intellectual curiosity  According to Harvard Business Review, data scientists spend 80% of their time discovering and preparing data. For this, they must always be a step ahead and catch up with the latest trends. Constant upskilling and a curiosity to learn new ways to solve existing problems quicker can get data scientists a long way in their careers. Taking data-driven decisions Data science is indisputably one of the leading industries today. Whether you are from a technical field or a non-technical background, there are several ways to build up the skill to become a data scientist. From online courses to bootcamps, one should always be a step ahead in this competitive field to build up their data work portfolios. Additionally, reading up on the latest technologies and regular experimentation with new trends is the way forward for aspirants.  

10 Mandatory Skills to Become a Data Scientist

4K
10 Mandatory Skills to Become a Data Scientist

The data science industry is growing at an alarming pace, generating a revenue of $3.03 billion in India alone. Even a 10% increase in data accessibility is said to result in over $65 million additional net income for the typical Fortune 1000 companies worldwide. The data scientist has been ranked the best job in the US for the 4th year in a row, with an average salary of $108,000; and the demand for more data scientists only seems to be growing. 

Who is a Data scientist? 

A data scientist is precisely someone who collects all the massive data that is available online, organizes the unstructured formats into bite-sized readable content, and analyses this to extract vital information about customer trends, thinking patterns, and behavior. This information is then used to create business goals or agendas that are aligned to the end-user/customer’s needs.  

This outlines that a data scientist is someone with sound technical knowledge, interpersonal skills, strong business acumen, and most importantly, a passionate data enthusiast. Listed below are some mandatory skills that an aspiring data scientist must develop. 

10 Mandatory Skills to Become a Data Scientist 

Technical Skills  

  • Programming, Packages, and Software 

Since the first task of data scientists is to gather all the information or raw data and transform this into actionable insights, they need to have advanced knowledge in coding and statistical data processing. Some of the common programming languages used by data scientists are Python, R, SQL, NoSQL, Java, Scala, Hadoop, and many more.  

  • Machine Learning and Deep Learning 

Machine Learning and Deep Learning are subsets of Artificial Intelligence (AI). Data science largely overlaps the growing field of AI, as data scientists use their potentials to clean, prepare, and extract data to run several AI applications. While machine learning enables supervised, unsupervised, and reinforced learning, deep learning helps in making datasets study and learn from existing information. A good example is the facial recognition feature in photos, doodling games like quick draw, and more. 

  • NLP, Cloud Computing and others 

Natural Language Processing (NLP), a branch of AI that uses the language used by human beings, processes it and learns to respond accordingly. Several apps and voice-assisted devices like Alexa and Siri are already using this remarkable feature. As data scientists use large amounts of data stored on clouds, familiarity with cloud computing software like AWS, Azure, and Google cloud will be beneficial. Learning frameworks like DevOps can help data scientists streamline their work, along with several other such upcoming technologies. 

  • Database knowledge, management, and visualization 

A collection of information organized to easily access, manage, and update the data is called a database. Data scientists must have a strong database knowledge and use its different types to their advantage. Some examples include relational databases like SQL, distributed database, cloud database, and many more. Once this is expertise is established, analyzing the data, database management, and data visualization are also important skills.  

  • Domain knowledge  

Domain knowledge about the domain in which data is to be analyzed and predictions will be made is important. One can harness the true power and fullest potential of an algorithm and data only by having specific domain language. Instead of waiting to analyze the data, the goals itself can be shaped towards actionable results with the help of domain knowledge.  

Non-technical Skills 

  • Communication skills 

As explained above, once the raw data is processed, it needs to be presented understandably. This does not limit the job to just visually coherent information but also the ability to communicate the insights of these visual representations. The data scientist should be excellent at communicating the results to the marketing team, sales team, business leaders, and other stakeholders. 

  • Team player 

This is related to the previous point. Along with effective communication skills, data scientists need to be good team players, accommodating feedback, and other inputs from business teams. They should also be able to efficiently communicate their requirements to the data engineers, data analysts, and other members of the team. Coordination with their team members can yield faster results and optimal outputs. 

  • Business acumen  

Since the job of the data scientist ultimately boils down to improving/growing the business, they need to be able to think from a business perspective while outlining their data structures. They should have in-depth knowledge of the industry of their business, the existing business problems of their company, and forecasting potential business problems and their solutions. 

  • Critical thinking 

Apart from finding insights, data scientists need to align these results with the business. They need to be able to frame appropriate questions and steps/solutions to solve business problems. This objective ability to analyze data and addressing the problem from multiple angles is crucial in a data scientist. 

  • Intellectual curiosity  

According to Harvard Business Review, data scientists spend 80% of their time discovering and preparing data. For this, they must always be a step ahead and catch up with the latest trends. Constant upskilling and a curiosity to learn new ways to solve existing problems quicker can get data scientists a long way in their careers. 

Taking data-driven decisions 

Data science is indisputably one of the leading industries today. Whether you are from a technical field or a non-technical background, there are several ways to build up the skill to become a data scientist. From online courses to bootcamps, one should always be a step ahead in this competitive field to build up their data work portfolios. Additionally, reading up on the latest technologies and regular experimentation with new trends is the way forward for aspirants.  

KnowledgeHut

KnowledgeHut

Author

KnowledgeHut is an outcome-focused global ed-tech company. We help organizations and professionals unlock excellence through skills development. We offer training solutions under the people and process, data science, full-stack development, cybersecurity, future technologies and digital transformation verticals.
Website : https://www.knowledgehut.com

Join the Discussion

Your email address will not be published. Required fields are marked *

Suggested Blogs

The Role of Mathematics in Machine Learning

IntroductionAutomation and machine learning have changed our lives. From the most technologically savvy person working in leading digital platform companies like Google or Facebook to someone who is just a smartphone user, there are very few who have not been impacted by artificial intelligence or machine learning in some form or the other;  through social media, smart banking, healthcare or even Uber.  From self – driving Cars, robots, image recognition, diagnostic assessments, recommendation engines, Photo Tagging, fraud detection and more, the future for machine learning and AI is bright and full of untapped possibilities.With the promise of so much innovation and path-breaking ideas, any person remotely interested in futuristic technology may aspire to make a career in machine learning. But how can you, as a beginner, learn about the latest technologies and the various diverse fields that contribute to it? You may have heard of many cool sounding job profiles like Data Scientist, Data Analyst, Data Engineer, Machine Learning Engineer etc., that are not just rewarding monetarily but also allow one to grow as a developer and creator and work at some of the most prolific technology companies of our times. But how do you get started if you want to embark on a career in machine learning? What education background should you pursue and what are the skills you need to learn? Machine learning is a field that encompasses probability, statistics, computer science and algorithms that are used to create intelligent applications. These applications have the capability to glean useful and insightful information from data that is useful to arrive business insights. Since machine learning is all about the study and use of algorithms, it is important that you have a base in mathematics.Why do I need to Learn Math?Math has become part of our day-to-day life. From the time we wake up to the time we go to bed, we use math in every aspect of our life. But you may wonder about the importance of math in Machine learning and whether and how it can be used to solve any real-world business problems.Whatever your goal is, whether it’s to be a Data Scientist, Data Analyst, or Machine Learning Engineer, your primary area of focus should be on “Mathematics”.  Math is the basic building block to solve all the Business and Data driven applications in the real-world scenario. From analyzing company transactions to understanding how to grow in the day-to-day market, making future stock predictions of the company to predicting future sales, Math is used in almost every area of business. The applications of math are used in many Industries like Retail, Manufacturing, IT to bring out the company overview in terms of sales, production, goods intake, wage paid, prediction of their level in the present market and much more.Pillars of Machine LearningTo get a head start and familiarize ourselves with the latest technologies like Machine learning, Data Science, and Artificial Intelligence, we have to understand the basic concepts of Math, write our own Algorithms and implement  existing Algorithms to solve many real-world problems.There are four pillars of Machine Learning, in which most of our real-world business problems are solved. Many algorithms in Machine Learning are also written using these pillars. They areStatisticsProbabilityCalculusLinear AlgebraMachine learning is all about dealing with data. We collect the data from organizations or from any repositories like Kaggle, UCI etc., and perform various operations on the dataset like cleaning and processing the data, visualizing and predicting the output of the data. For all the operations we perform on data, there is one common foundation that helps us achieve all of this through computation-- and that is Math.STATISTICSIt is used in drawing conclusions from data. It deals with the statistical methods of collecting, presenting, analyzing and interpreting the Numerical data. Statistics plays an important role in the field of Machine Learning as it deals with large amounts of data and is a key factor behind growth and development of an organization.Collection of data is possible from Census, Samples, Primary or Secondary data sources and more. This stage helps us to identify our goals in order to work on further steps.The data that is collected contains noise, improper data, null values, outliers etc. We need to clean the data and transform it into a meaningful observations.The data should be represented in a suitable and concise manner. It is one of the most crucial steps as it helps to understand the insights and is used as the foundation for further analysis of data.Analysis of data includes Condensation, Summarization, Conclusion etc., through the means of central tendencies, dispersion, skewness, Kurtosis, co-relation, regression and other methods.The Interpretation step includes drawing conclusions from the data collected as the figures don’t speak for themselves.Statistics used in Machine Learning is broadly divided into two categories, based on the type of analyses they perform on the data. They are Descriptive Statistics and Inferential Statistics.a) Descriptive StatisticsConcerned with describing and summarizing the target populationIt works on a small dataset.The end results are shown in the form of pictorial representations.The tools used in Descriptive Statistics are – Mean, Median, Mode which are the measures of Central and Range, Standard Deviation, variance etc., which are the measures of Variability.b) Inferential StatisticsMethods of making decisions or predictions about a population based on the sample information.It works on a large dataset.Compares, tests and predicts the future outcomes.The end results are shown in the probability scores.The specialty of the inferential statistics is that, it makes conclusions about the population beyond the data available.Hypothesis tests, Sampling Distributions, Analysis of Variance (ANOVA) etc., are the tools used in Inferential Statistics.Statistics plays a crucial role in Machine Learning Algorithms. The role of a Data Analyst in the Industry is to draw conclusions from the data, and for this he/she requires Statistics and is dependent on it.PROBABILITYThe word probability denotes the happening of a certain event, and the likelihood of the occurrence of that event, based on old experiences. In the field of Machine Learning, it is used in predicting the likelihood of future events.  Probability of an event is calculated asP(Event) = Favorable Outcomes / Total Number of Possible OutcomesIn the field of Probability, an event is a set of outcomes of an experiment. The P(E) represents the probability of an event occurring, and E is called an Event. The probability of any event lies in between 0 to 1. A situation in which the event E might occur or not is called a Trail.Some of the basic concepts required in probability are as followsJoint Probability: P(A ∩ B) = P(A). P(B), this type of probability is possible only when the events A and B are Independent of each other.Conditional Probability: It is the probability of the happening of event A, when it is known that another event B has already happened and is denoted by P (A|B)i.e., P(A|B) = P(A ∩ B)/ P(B)Bayes theorem: It is referred to as the applications of the results of probability theory that involve estimating unknown probabilities and making decisions on the basis of new sample information. It is useful in solving business problems in the presence of additional information. The reason behind the popularity of this theorem is because of its usefulness in revising a set of old probabilities (Prior Probability) with some additional information and to derive a set of new probabilities (Posterior Probability).From the above equation it is inferred that “Bayes theorem explains the relationship between the Conditional Probabilities of events.” This theorem works mainly on uncertainty samples of data and is helpful in determining the ‘Specificity’ and ‘Sensitivity’ of data. This theorem plays an important role in drawing the CONFUSION MATRIX.Confusion matrix is a table-like structure that measures the performance of Machine Learning Models or Algorithms that we develop. This is helpful in determining the True Positive rates, True Negative Rates, False Positive Rates, False Negative Rates, Precision, Recall, F1-score, Accuracy, and Specificity in drawing the ROC Curve from the given data.We need to further focus on Probability distributions which are classified as Discrete and Continuous, Likelihood Estimation Functions etc. In Machine Learning, the Naive Bayes Algorithm works on the probabilistic way, with the assumption that input features are independent.Probability is an important area in most business applications as it helps in predicting the future outcomes from the data and takes further steps. Data Scientists, Data Analysts, and Machine Learning Engineers use this probability concept very often as their job is to take inputs and predict the possible outcomes.CALCULUS:This is a branch of Mathematics, that helps in studying rates of change of quantities. It deals with optimizing the performance of machine learning models or Algorithms. Without understanding this concept of calculus, it is difficult to compute probabilities on the data and we cannot draw the possible outcomes from the data we take. Calculus is mainly focused on integrals, limits, derivatives, and functions. It is divided into two types called Differential Statistics and Inferential Statistics. It is used in back propagation algorithms to train deep Neural Networks.Differential Calculus splits the given data into small pieces to know how it changes.Inferential Calculus combines (joins) the small pieces to find how much there is.Calculus is mainly used in optimizing Machine Learning and Deep Learning Algorithms. It is used to develop fast and efficient solutions. The concept of calculus is used in Algorithms like Gradient Descent and Stochastic Gradient Descent (SGD) algorithms and in Optimizers like Adam, Rms Drop, Adadelta etc.Data Scientists mainly use calculus in building many Deep Learning and Machine Learning Models. They are involved in optimizing the data and bringing out better outputs of data, by drawing intelligent insights hidden in them.Linear Algebra:Linear Algebra focuses more on computation. It plays a crucial role in understanding the background theory behind Machine learning and is also used for Deep Learning. It gives us better insights into how the algorithms really work in day-to-day life, and enables us to take better decisions. It mostly deals with Vectors and Matrices.A scalar is a single number.A vector is an array of numbers represented in a row or column, and it has only a single index for accessing it (i.e., either Rows or Columns)A matrix is a 2D array of numbers and can be accessed with the help of both the indices (i.e., by both rows and columns)A tensor is an array of numbers, placed in a grid in a particular order with a variable number of axes.The package named Numpy in the Python library is used in computation of all these numerical operations on the data. The Numpy library carries out the basic operations like addition, subtraction, Multiplication, division etc., of vectors and matrices and results in a meaningful value at the end. Numpy is represented in the form of N-d array.Machine learning models cannot be developed, complex data structures cannot be manipulated, and operations on matrices would not have been performed without the presence of Linear Algebra. All the results of the models are displayed using Linear Algebra as a platform.Some of the Machine Learning algorithms like Linear, Logistic regression, SVM and Decision trees use Linear Algebra in building the algorithms. And with the help of Linear Algebra we can build our own ML algorithms. Data Scientists and Machine Learning Engineers work with Linear Algebra in building their own algorithms when working with data.How do Python functions correlate to Mathematical Functions?So far, we have seen the importance of Mathematics in Machine Learning. But how do Mathematical functions corelate to Python functions when building a machine learning algorithm? The answer is quite simple. In Python, we take the data from our dataset and apply many functions to it. The data can be of different forms like characters, strings, numerical, float values, double values, Boolean values, special characters, Garbage values etc., in the data set that we take to solve a particular machine learning problem. But we commonly know that the computer understands only “zeroes & ones”. Whatever we take as input to our machine learning model from the dataset, the computer is going to understand it as binary “Zeroes & ones” only.Here the Python functions like “Numpy, Scipy, Pandas etc.,” mostly use pre-defined functions or libraries. These help us in applying the Mathematical functions to get better insights of the data from the dataset that we take. They help us to work on different types of data for processing and extracting information from them. Those functions further help us in cleaning the garbage values in data, the noise present in data and the null values present in data and finally help to make the dataset free from all the unwanted matter present in it. Once the data is preprocessed with the Python functions, we can apply our algorithms on the dataset to know which model works better for the data and we can find the accuracies of different algorithms applied on our dataset. The mathematical functions help us in visualizing the content present in the dataset, and helps to get better understanding on the data that we take and the problem we are addressing using a machine learning algorithm.Every algorithm that we use to build a machine learning model has math functions hidden in it, in the form of Python code. The algorithm that we develop can be used to solve a variety of things like a Boolean problem or a matrix problem like identifying an image in a crowd of people and much more. The final stage is to find the best algorithm that suits the model. This is where the mathematical functions in the Python language help us. It helps to analyze which algorithm is best through comparison functions like correlation, F1 score, Accuracy, Specificity, sensitivity etc. Mathematical functions also help us in finding out if the selected model is overfitting or underfitting to the data that we take.To conclude, we cannot apply the mathematical functions directly in building machine learning models, so we need a language to implement the mathematical strategies in the algorithm. This is why we use Python to implement our math models and draw better insights from the data. Python is a suitable language for implementations of this type. It is considered to be the best language among the other languages for solving real-world problems and implementing new techniques and strategies in the field of ML & Data Science.Conclusion:For machine learning enthusiasts and aspirants, mathematics is a crucial aspect to focus on, and it is important to build a strong foundation in Math. Each and every concept you learn in Machine Learning, every small algorithm you write or implement in solving a problem directly or indirectly has a relation to Mathematics.The concepts of math that are implemented in machine learning are built upon the basic math that we learn in 11th and 12th grades. It is the theoretical knowledge that we gain at that stage, but in the area of Machine Learning we experience the practical use cases of math that we have studied earlier.The best way to get familiar with the concepts of Mathematics is to take a Machine Learning Algorithm, find a use case, and solve and understand the math behind it.An understanding of math is paramount to enable us to come up with machine learning solutions to real world problems. A thorough knowledge of math concepts also helps us enhance our problem-solving skills.
3363
The Role of Mathematics in Machine Learning

IntroductionAutomation and machine learning have c... Read More

What Is Data Splitting in Learn and Test Data?

Data is the fuel of every machine learning algorithm, on which statistical inferences are made and predictions are done. Consequently, it is important to collect the data, clean it and use it with maximum efficacy. A decent data sampling can guarantee accurate predictions and drive the whole ML project forward whereas a bad data sampling can lead to incorrect predictions. Before diving into the sampling techniques, let us understand what the population is and how does it differ from a sample? The population is the assortment or the collection of the components which shares a few of the other characteristics for all intents and purposes. The total number of observations is said to be the size of the populationImage SourceThe sample is a subset of the population. The process of  choosing a sample from a given set of the population is known as sampling. The number of components in the example is the sample size. Data sampling refers to statistical approaches for picking observations from the domain to estimate a population parameter. Whereas data resampling refers to the drawing of repeated samples from the main or original source of data. It is the non-parametric procedure of statistical extrapolation. It produces unique sample distributions based on the original data and is used to improve the accuracy and intuitively measure the uncertainty of the population parameter. Sampling methods can be divided into two parts: Probability sampling procedure  Non-probability sampling procedure  The distinction between the two is that the example of determination depends on randomization. With randomization, each component persuades equivalent opportunity and is important for test for study. Probability Sampling – It is a method in which each element of a given population has an equivalent chance of being selected. Simple random sampling –For instance, a classroom has 100 students and each student has an equal chance of getting selected as the class representative Systematic sampling- It is a sampling technique in which the first element is selected at random and others get selected based on a fixed sampling interval. For instance, consider a population size of 20 (1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19.20) Suppose we want the element with the number 3 and a sample size of 5. The next selection will be made at an interval of 20/5 i.e. 4 so 3 + 4 = 7 so 3,7,11 and so on.  Stratified sampling – In this sampling process, the total group is subdivided into smaller groups, known as the strata, to obtain a sampling process. Assume that we need to identify the average number of votes in three different cities to elect a representative. City x has 1 million citizens, city y has 2 million citizens, city z has 3 million citizens. We can randomly choose a sample size of 60 for the entire population. But if you notice, the random samples are not balanced with respect to the different cities. Hence there could be an estimation error. To overcome this, we may choose a random sample of 10,20,30 from city x, y, z respectively. We can therefore minimize the total estimated error. Reservoir sampling is a randomized algorithm. It is used to select k out of n samples. The n is generally very large or unknown. For instance, reservoir sampling can be used to obtain k out of the number of fish in a lake. Cluster sampling - samples are taken as subgroup /clusters of the population. These subgroups are selected at random. Image SourceNon-probability sampling – In a non-probability sampling method, each instance of a population does not have an equivalent chance of being selected. There is an element of risk of ending up with a non-representative sample which might not bring out a comprehensive outcome. Convenience sampling - This sampling technique includes people or samples that are easy to reach. Though it is the easiest methodology to collect a sample it runs a high risk of not being representative of a population. For instance, consider a population size of 20 (1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19.20) The surveyor wants the person 4,7,11,18 to participate, hence it can create selection bias. Quota sampling – In Quota sampling methods the sample or the instances are chosen based on their traits or characteristics which matches with the population For instance, consider a population size of 20 (1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19.20) Consider a quota in multiple of 4 - (4,8,12,16,20) Judgement sampling - Also known as selective sampling. Here individuals are asked to participate.  Snowball sampling - In this sampling technique, an individual element/person can nominate further elements/people known to them. It is only applicable when the sampling frame is difficult to identify.  A nominates P, P nominates G, G nominates M A > P > G > M The non-probability sampling technique may lead to selection bias and population misrepresentation.  Image SourceWe often come across the case of an imbalanced dataset.  Resampling is a technique used to overcome or to deal with  imbalanced datasets It includes removing samples/elements from the majority class i.e. undersampling  Adding more instances from the minority class i.e. Oversampling  There is a dedicated library to tackle imbalanced datasets in Python - known as imblearn. Imblearn has multiple methods to handle undersampling and oversampling    Image SourceTomek Links for under-sampling - pairs of examples from opposite classes in close instancesMajority elements are eliminated from the Tomek Links which intuitively provides a better understanding and decision boundary for ML classifier  SMOTE for oversampling - Synthetic Minority Oversampling Technique - works by increasing new examples from the minority cases. It is a statistical technique of increasing or generating the number of instances in the dataset in a more balanced manner.  Image SourcePick a minority class as the input vector  Discover its k closest neighbors (k_neighbors is indicated as a contention in the SMOTE()) Pick one of these neighbors and spot a synthetic point anyplace on the line joining the point viable and its picked neighbor  Rehash the above steps until it is adjusted or balanced Other must-read sampling methods - Near miss, cluster centroids for under sampling, ADASYN and bSMOTE for oversampling  Train-Test split  Python is bundled with overpowered ML library. The train_test_Split() module from Scikit-Learn library is one of the major python modules that provides a function to split the datasets into multiple subsets in different ways or let us say randomly into training and validation datasets. The parameter train_size takes a fraction between zero and one for specifying the training size. The remaining samples in the original data set are for testing purposes. The record which is selected for training and test sets are randomly sampled. The simplest method train_test_split() or the split_train_test() are more or less the same. train set – the subset of the dataset to train a model test set - the subset of the dataset to test the trained model The train-test method is used to measure the performance of ML algorithms  It is appropriate to use this procedure when the dataset is very large For any supervised Machine learning algorithms, train-test split can be implemented.  Involves taking the data set as a whole and further subdividing it into two subsets The training dataset is used to fit the model  The test dataset serves as an input to the model The model predictions are made on the test data  The output (prediction) is compared to the expected values  The ultimate objective is to evaluate the performance of the said ML model against the new or unseen data. A visual representation of training or test data:  Image SourceIt is important to note that the test data adheres to the following conditions:   Be large enough to fetch statistically significant results Is a representation of the whole dataset. One must not pick the test set with different traits/characteristics of the training set. Never train on test data - don’t get fooled by good results and high accuracy. It might be the case that one has accidentally trained the model on the test data. The train_test_split() is coupled with additional features: a random seed generator as random_state parameter – this ensures which samples go to training and which go to the test set It takes multiple data sets with the matching number of rows and splits them on similar indices. The train_test_split returns four variables  train_X  - which covers X features of the training set. train_y – which contains the value of a response variable from the training set test_X – which includes X features of the test set test_y – which consists of values of the response variable for the test set. There is no exact rule to split the data by 80:20 or 70:30; it depends on the data and the target variable. Some of the data scientists use a range of 60% to 80% for training and the rest for testing the model. To find the length or the number of records we use len function of python > len(X_train), len (X_test) The model is built by using the training set and is tested using the test set X_train and y_train contain the independent features or variables and response variable values for training datasets respectively. On the other hand, X_test and y_test include the independent features and response variables values for the test dataset respectively. Conclusion: Sampling is an ongoing process of accumulating the information or the observations on an estimate of the population variable. We learnt about sampling types - probability sampling procedure and non-probability sampling procedure. Resampling is a repeated process to draw samples from the main data source. And finally, we learnt about training, testing and splitting the data which are used to measure the performance of the model. The training and testing of the model are done to understand the data discrepancies and develop a better understanding of the machine learning model. 
5435
What Is Data Splitting in Learn and Test Data?

Data is the fuel of every machine learning algo... Read More

Data Preparation for Machine Learning Projects

The data we collect for machine-learning must be pre-processed before it can be used to fit a model. Data preparation is essentially, the task of modifying raw data into a form that can be used for modelling, mostly by data addition, deletion or other data transformation techniques.  We need to pre-process the data before feeding into any algorithm mainly due to the following reasons: Messy data – Real world data is messy, with missing values, redundant values, out-of-range values, errors and noise. Machine learning algorithms need numeric data. More often than not, algorithms have requirements on the input data, for example some algorithms assume a certain probability distribution of the data, others might perform worse if the predictor variables are highly correlated etc. Data preparation tasks are mostly dependent on the dataset we are working with, and to some extent on the choice of model. However, it becomes more evident after initial analysis of the data and EDA. For e.g. looking at the summary statistics, we know if predictors need to be scaled. Looking at correlation matrix you can find out if there are highly correlated predictors. Looking at various plots, e.g. boxplot, you can find, if outliers need to be dealt with, so on and so forth. Even though every dataset is different, we can define a few common steps which can guide us in preparing the data to feed into our learning algorithms. Some common tasks that contribute to data pre-processing are: Data Cleaning Feature Selection Data Transformation Feature Engineering Dimensionality Reduction Note: Throughout this article, we will refer to Python libraries and syntaxes. Data Cleaning: It can be summed up as the process of correcting the errors in the data. Errors could be in the form of missing values, redundant rows or columns, variables with zero or near zero variance and so on. Thus, data cleaning involves a few or all of the below sub-tasks: Redundant samples or duplicate rows: should be identified and dropped from the dataset. In Python,  functions in Pandas such as duplicated() can be used to identify such samples and drop_duplicates() can be used to drop such rows. Redundant Features: If the dataset has features which are highly correlated, it may lead to multi-collinearity (irregular regression coefficient estimates). Such columns can be identified using the correlation matrix and one of the pairs of the highly correlated feature should be dropped. Similarly, near zero variance features, which have the same value for all the samples do not contribute to the variance in data. Such columns should be identified and dropped from the dataset.  Outlier Detection: Outliers are extreme values which fall far away from other observations. Outliers can skew the descriptive statistics of the data, hence mislead data interpretations and negatively impact model performance. So, it is important that the outliers are detected and dealt with. Outliers can be detected through data visualization techniques like box-plots and scatter plots.  Example of outliers being detected using box plots:  Image Source Outliers can also be detected by computing the z-scores or the Inter-Quartile range. When using z-score, a data point which is more than 3 standard deviations away from the mean is normally considered as an outlier.  However, this may vary based on the size of the dataset. When using inter-quartile range, a point which is below Q1 - 1.5 inter-quartile range or above Q3 + 1.5 inter-quartile range is considered to be an outlier, where Q1 is the first quartile and Q3 is the third quartile. Below diagram shows outliers which are more than 3 standard deviations from the mean: Image Source If there are a few outliers, you may choose to drop the samples with outliers. Else if there are too many outliers, these can be modelled separately. We may also choose to cap or floor the outlier values by the 95th percentile or 5th percentile value. However, you may choose the appropriate replacement value by analyzing the deciles of the data. Missing Values: Data with missing values cannot be used for modelling; hence any missing values should be identified and cleaned. If the data in the predictor or sample is sparse, we may choose to drop the entire column/row. Else we may impute the missing value with mean or median. Missing values in categorical variables can be replaced with the most frequent class. Points to remember: Use z-score for outlier detection if the data follows Gaussian distribution, else use Inter-Quartile range for outlier detection. Feature Selection: Sometimes datasets have hundreds of input variables, not all of which are good predictors of the target and may contribute to noise in the data. Feature selection techniques are used to find the input variables that can most efficiently predict the target variable, in order to reduce the number of input variables. Feature selection techniques can be further classified as supervised selection techniques and unsupervised selection techniques. As the name suggests, unsupervised selection techniques do not consider the target variable while eliminating the input variables. This would include techniques like using correlation to eliminate highly correlated predictors or eliminating low variance predictors. Supervised feature selection techniques consider the target variable for selecting the features to be eliminated. These can be further divided into three groups namely, Intrinsic, Filter and Wrapper techniques. Intrinsic – the feature selection process is embedded in the model building process itself, for e.g. tree-based algorithms which pick up the best predictor for the split. Similarly, regularization techniques like lasso shrinks the coefficient of the predictors such that the coefficient can be shrunk to zero for some predictors, and hence are excluded from the model. Multivariate adaptive regression spline (MARS) models also fall under this category. A major advantage of such methods is that since the feature selection is a part of model building process, it is relatively fast. However model dependance can also prove to be disadvantageous for e.g. some tree-based algorithms are greedy and hence may select predictors which may lead to sub-optimal fit. Filter – Filter based selection techniques use some statistical method to score each predictor separately with the target variable and choose the predictors with highest scores. It is mostly univariate analysis, i.e., each predictor is evaluated in isolation. It does not consider the correlation of independent variables amongst themselves. Based on the type of the input variable i.e., numerical or categorical and the type of output variable an appropriate statistical measure can be used to evaluate predictors for feature selection: for example, Pearson’s correlation coefficient, Spearmon’s correlation coefficient, ANOVA, Chi-square. Wrapper – Wrapper feature selection builds models using various subsets of predictors iteratively, and evaluates the model, until it finds a subset of features which best predict the target. These methods are agnostic to the type of variables. However, they are computationally more taxing. RFE is a commonly used wrapper-based feature selection method. Recursive Feature Elimination is a greedy backward elimination technique, which starts with a complete set of predictors and systematically eliminates less useful predictors, until it finds a subset of predictors which best predict the target variable with the specified number of predictors. Two important hyperparameters for RFE algorithm in scikit learn are the number of predictors(n_features_to_select) and the algorithm of choice (estimator). Points to remember: Feature selection techniques reduce the number of features by excluding or eliminating the existing features from the dataset, whereas dimensionality reduction techniques create a projection of the data in lower dimensional feature space, which does not have a one-to-one mapping with the existing features. However, both have a similar goal of reducing the number of independent variables. Data Transformations: We may need to transform data to change its data type, scale or distribution. Type: We need to analyze the input variables at the very beginning to understand if the predictors are represented with the appropriate data type, and do the required conversions before progressing with the EDA and modelling. For e.g., sometimes the Boolean values are encoded as true and false, and we may transform them to take values 0 and 1. Similarly sometimes we may come across integer variables where it might be more appropriate to treat it as a categorical variable. For e.g. when working on a dataset to predict car prices, it would be more appropriate to treat the variable ‘Number of doors’ which takes up values {2,4} as a categorical variable.  Categorical variables should be converted to numeric, before they can be used for modelling. There are many categorical variable encoding techniques like, N-1 dummy encoding, 1 Hot encoding, label encoding, frequency encoding. Ordinal encoding can be used when we want to specify and maintain the order of the ordinal variable. Scale: Predictor variables may have different units (Km, $, years etc.) and hence, different scales. For e.g. we might have input variables like age and salary in a dataset. Scale of the variable salary will always be much higher than the age, and hence may contribute unequally to the model and create a bias. Hence, we transform the predictors to bring them to a common scale. Normalization and standardization are the most widely used scaling techniques. Normalization: helps scale the data such that all values lie between the range of 0 and 1. The scikit-learn library method even allows one to specify the preferred range. Data shown before and after normalization:  Image SourceStandarisation: We standardize the data by centering it around the mean and then scaling the data by the standard deviation. In other words, mean of the variable is subtracted from each value of the input variable and the difference is divided by the standard deviation of the variable. The resulting data will have zero mean and standard deviation 1. Standardisation assumes that the data follows a Gaussian distribution. Scikit learn library in python can be used for normalization (MinMaxScaler()) and standardization (StandardScaler()).  Data shown before and after standardization:  Image Source Distribution: Many algorithms assume Gaussian distribution for the underlying data. If the data is not Gaussian or is Gaussian like, we can transform the data to reduce the skewness. Box-Cox transform, or Yeo-Johnson transform can be used to perform power transformations on the data. Box-Cox transform applies a different transformation on the data based on the value of lambda. For e.g. for Lambda = -1, it does inverse transformation, for Lambda=0 it does log transformation, for Lambda = 0.5, it does square root transformation, for Lambda = -0.5 it does reciprocal square root transformation. PowerTransformer() class in the python scikit library can be used for making these power transformations.Data shown before and after log transformation: Image SourcePoints to remember: Data transformations should be done on the training dataset, so that the statistic required for transformation is estimated from the training set only and then applied on the validation set. Decision trees and other tree-based ensembles like Random forest and boosting algorithms are not impacted by different scale of the input variables. Hence scaling may not be required.  Linear regression and neural networks which use weighted sum of the input variables and K-nearest neighbors or SVM which compute distance or dot product between predictors will be impacted by the scale of the predictors, hence input variables should be scaled for these models. Between normalization and standardization, one should standardize when the data follows a Gaussian distribution, else normalize. Feature Engineering:  is the part of data pre-processing where we derive new features using one or more existing features. For e.g. when working on taxi fare prediction problem, we may derive a new feature, distance travelled in the ride with the use of latitude and longitude co-ordinates of the start and end point of the ride. Or when working on predicting sales or foot fall for a retail business we may need to add a new feature to factor in, the impact of holiday, weekends and festivals on the target variable. Hence, we may need to engineer these new predictors and feed them into our model to identify the underlying patterns effectively. Polynomial term: We may also add new features by raising the existing input variables to a higher degree polynomial. Polynomial terms help the model learn the non-linear patterns. When polynomial terms of existing features are added to the linear regression model, it is termed as polynomial regression. Usually, we stick to a smaller degree of 2 or 3. Interaction term: We may add new features that represent interaction between existing features by adding a product of two features. For e.g. if we are working on a problem to help businesses allocate their marketing budget between various marketing mediums like radio, TV and newspaper, we need to model how effective each medium is. We may like to factor in the interaction term of a radio and newspaper campaign, to understand the effectiveness of marketing if both the radio and newspaper campaigns were run together at the same time. Similarly, when predicting a crop yield, we may engineer a new interaction term for fertilizer and water together to factor in how the yield varies when water and fertilizer are provided together. Points to remember: When using polynomial terms in the model, it is good practice to restrict the degree of the polynomial to 3 or at most 4. This is firstly, to control the number of input variables. Secondly, a larger degree of the polynomial will result in large values which may impact the weights(parameters) to be large and hence make the model less sensitive to small changes. Domain knowledge or the advice of the SME may come in handy to identify effective interaction terms. Dimensionality Reduction: Sometimes data might have hundreds and even thousands of features. High dimensional data can be more complicated, with way more parameters to train and a very complicated model structure. In higher dimensions, the volume of space is huge, and the data points become sparse, which could negatively impact the machine learning algorithm performance. This is sometimes also referred to as the curse of dimensionality.  Dimensionality Reduction techniques are used to reduce the number of predictor variables in the dataset. Some techniques for dimensionality reduction are: PCA or Principal Component Analysis uses linear algebra and Eigenvalue to achieve dimensionality reduction. For given datapoints PCA finds orthogonal set of directions, that have maximum variance. Rotating the reference frame, it finds the directions (ones which correspond to smallest eigen values) which can be neglected. Principal Component Analysis applied to a dataset is shown below: Manifold learning is a non-linear dimensionality reduction technique which uses geometric properties of the data, to create low dimensional projections of a high dimensional data, while preserving its structure and relationships, and to visualize high dimensional data, which is otherwise difficult. SOM Self organizing Map also called Kohonen map and t-SNE are examples of Manifold learning techniques.  t-distributed stochastic neighbor embedding (t-SNE) computes the probability that pairs of datapoints (in high dimension) are related and maps them in low dimension, such that data has a similar distribution. Autoencoders are deep learning neural networks that learn low dimensional representation of a given dataset in an unsupervised manner. The hidden layer is limited to contain fewer neurons, thus it learns to map high dimensional input vector into low dimensional vector, while still preserving the underlying structure and relationships in the data. Autoencoders have two parts, encoder which learns to map high dimensional vector to a low-dimensional space and decoder, which maps the data from low to high dimension. The output from the encoder with reduced dimension can be fed into any another model for supervised learning. Points to remember:  Dimensionality reduction is mostly performed after data cleaning and data scaling.  It is imperative that the dimensionality reduction performed on the training data set must also be performed on the validation and the new data on which the model will predict. Conclusion: Data preparation is an important and integral step of machine learning projects. There are multiple techniques for various data cleaning tasks. However, there are no best or worst data cleaning techniques. Every machine learning problem is unique and so is the underlying data. We need to apply different techniques and see what works best based on the data and the problem at hand.  
10481
Data Preparation for Machine Learning Projects

The data we collect for machine-learning must be... Read More