For enquiries call:

Phone

+1-469-442-0620

HomeBlogData ScienceWhat Is Data Splitting in Learn and Test Data?

What Is Data Splitting in Learn and Test Data?

Published
05th Sep, 2023
Views
view count loader
Read it in
8 Mins
In this article
    What Is Data Splitting in Learn and Test Data?

    Data is the fuel of every machine learning algorithmon which statistical inferences are made and predictions are done. Consequently, it is important to collect the data, clean it and use it with maximum efficacy. A decent data sampling can guarantee accurate predictions and drive the whole ML project forward whereas a bad data sampling can lead to incorrect predictions. Before diving into the sampling techniques, let us understand what the population is and how does it differ from a sample? 

    The population is the assortment or the collection of the components which shares a few of the other characteristics for all intents and purposes. The total number of observations is said to be the size of the population. For more information, check out Data courses online.  

    What Is Data Splitting in Learn and Test Data
    Image Source

    The sample is a subset of the population. The process of choosing a sample from a given set of the population is known as sampling. The number of components in the example is the sample size. 

    Data sampling refers to statistical approaches for picking observations from the domain to estimate a population parameter. Whereas data resampling refers to the drawing of repeated samples from the main or original source of data. It is the non-parametric procedure of statistical extrapolation. It produces unique sample distributions based on the original data and is used to improve the accuracy and intuitively measure the uncertainty of the population parameter. 

    Sampling methods can be divided into two parts: 

    1. Probability sampling procedure  
    2. Non-probability sampling procedure  

    The distinction between the two is that the example of determination depends on randomization. With randomization, each component persuades equivalent opportunity and is important for test for study. 

    Probability Sampling  It is a method in which each element of a given population has an equivalent chance of being selected. 

    • Simple random sampling –For instance, a classroom has 100 students, and each student has an equal chance of getting selected as the class representative. 
    • Systematic sampling- It is a sampling technique in which the first element is selected at random, and others get selected based on a fixed sampling interval. 

    For instance, consider a population size of 20 (1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19.20) 

    Suppose we want the element with the number 3 and a sample size of 5. The next selection will be made at an interval of 20/5 i.e. 4 so 3 + 4 = 7 so 3,7,11 and so on. 

    •  Stratified sampling – In this sampling process, the total group is subdivided into smaller groups, known as the stratato obtain a sampling process. 

    Assume that we need to identify the average number of votes in three different cities to elect a representative. City x has 1 million citizens, city y has 2 million citizens, city z has 3 million citizens. We can randomly choose a sample size of 60 for the entire population. But if you notice, the random samples are not balanced with respect to the different cities. Hence there could be an estimation error. To overcome this, we may choose a random sample of 10,20,30 from city x, y, z respectively. We can therefore minimize the total estimated error. 

    • Reservoir sampling is a randomized algorithm. It is used to select k out of n samples. The n is generally very large or unknown. For instance, reservoir sampling can be used to obtain k out of the number of fish in a lake. 
    • Cluster sampling - samples are taken as subgroup /clusters of the population. These subgroups are selected at random. 

    What Is Data Splitting in Learn and Test Data

    Image Source

    Non-probability sampling  In a non-probability sampling method, each instance of a population does not have an equivalent chance of being selected. There is an element of risk of ending up with a non-representative sample which might not bring out a comprehensive outcome. 

    • Convenience sampling - This sampling technique includes people or samples that are easy to reach. Though it is the easiest methodology to collect a sample it runs a high risk of not being representative of a population. 

    For instance, consider a population size of 20 (1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19.20) 

    The surveyor wants the person 4,7,11,18 to participate, hence it can create selection bias. 

    • Quota sampling – In Quota sampling methods the sample or the instances are chosen based on their traits or characteristics which matches with the population. 

    For instanceconsider a population size of 20 (1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19.20) 

    Consider a quota in multiple of 4 - (4,8,12,16,20) 

    • Judgement sampling - Also known as selective sampling. Here individuals are asked to participate.  
    • Snowball sampling - In this sampling technique, an individual element/person can nominate further elements/people known to them. It is only applicable when the sampling frame is difficult to identify. 

     A nominates P, P nominates G, G nominates M 

    A > P > G > M 

    The non-probability sampling technique may lead to selection bias and population misrepresentation.  

    Oversampling v/s Undersampling
    Image Source

    • We often come across the case of an imbalanced dataset.  
    • Resampling is a technique used to overcome or to deal with imbalanced datasets. 
    • It includes removing samples/elements from the majority class i.e., undersampling  
    • Adding more instances from the minority class i.e., Oversampling  

    There is a dedicated library to tackle imbalanced datasets in Python - known as imblearnImblearn has multiple methods to handle undersampling and oversampling. Enroll in KnowledgeHut Data courses online to advance your career in Data Science.     

    Tomek Links

    Image Source

    • Tomek Links for under-sampling - pairs of examples from opposite classes in close instances
    • Majority elements are eliminated from the Tomek Links which intuitively provides a better understanding and decision boundary for ML classifier.  
    • SMOTE for oversampling - Synthetic Minority Oversampling Technique - works by increasing new examples from the minority cases. It is a statistical technique of increasing or generating the number of instances in the dataset in a more balanced manner.  

    Synthetic Samples

    Image Source

    • Pick a minority class as the input vector.  
    • Discover its k closest neighbors (k_neighbors is indicated as a contention in the SMOTE()) 
    • Pick one of these neighbors and spot a synthetic point anyplace on the line joining the point viable and its picked neighbor  
    • Rehash the above steps until it is adjusted or balanced 
    • Other must-read sampling methods - Near miss, cluster centroids for under sampling, ADASYN and bSMOTE for oversampling  

    Train-Test split  

    Python is bundled with overpowered ML library. The train_test_Split() module from Scikit-Learn library is one of the major python modules that provides a function to split the datasets into multiple subsets in different ways or let us say randomly into training and validation datasets. The parameter train_size takes a fraction between zero and one for specifying the training size. The remaining samples in the original data set are for testing purposes. The record which is selected for training and test sets are randomly sampled. The simplest method train_test_split() or the split_train_test() are more or less the same. 

    • train set – the subset of the dataset to train a model 
    • test set - the subset of the dataset to test the trained model 

    Training Set v/s Test Set

    • The train-test method is used to measure the performance of ML algorithms  
    • It is appropriate to use this procedure when the dataset is very large 
    • For any supervised Machine learning algorithms, train-test split can be implemented.  
    • Involves taking the data set as a whole and further subdividing it into two subsets 
    • The training dataset is used to fit the model  
    • The test dataset serves as an input to the model 
    • The model predictions are made on the test data  
    • The output (prediction) is compared to the expected values  
    • The ultimate objective is to evaluate the performance of the said ML model against the new or unseen data. 

    A visual representation of training or test data:  

    Training Set v/s Test Set

    Image Source

    It is important to note that the test data adheres to the following conditions:   

    1. Be large enough to fetch statistically significant results 

    1. Is a representation of the whole dataset. One must not pick the test set with different traits/characteristics of the training set. 

    1. Never train on test data - don’t get fooled by good results and high accuracy. It might be the case that one has accidentally trained the model on the test data. 

    The train_test_split() is coupled with additional features: 

    • a random seed generator as random_state parameter – this ensures which samples go to training and which go to the test set 
    • It takes multiple data sets with the matching number of rows and splits them on similar indices. 
    • The train_test_split returns four variables  
      • train_X  - which covers X features of the training set. 
      • train_y – which contains the value of a response variable from the training set 
      • test_X – which includes X features of the test set 
      • test_y – which consists of values of the response variable for the test set. 

    What Is Data Splitting in Learn and Test Data

    • There is no exact rule to split the data by 80:20 or 70:30; it depends on the data and the target variable. Some of the data scientists use a range of 60% to 80% for training and the rest for testing the model. 
    • To find the length or the number of records we use len function of python > len(X_train), len (X_test) 
    • The model is built by using the training set and is tested using the test set 
    • X_train and y_train contain the independent features or variables and response variable values for training datasets respectively. 
    • On the other hand, X_test and y_test include the independent features and response variables values for the test dataset respectively. 

    Conclusion:

    Sampling is an ongoing process of accumulating the information or the observations on an estimate of the population variable. We learnabout sampling types - probability sampling procedure and non-probability sampling procedure. Resampling is a repeated process to draw samples from the main data source. And finally, we learnt about training, testing and splitting the data which are used to measure the performance of the model. The training and testing of the model are done to understand the data discrepancies and develop a better understanding of the machine learning model. 

    Profile

    Dipayan Ghatak

    Project Manager

    Leading Projects across geographies in Microsoft Consultant Services.

    Share This Article
    Ready to Master the Skills that Drive Your Career?

    Avail your free 1:1 mentorship session.

    Select
    Your Message (Optional)

    Upcoming Data Science Batches & Dates

    NameDateFeeKnow more
    Course advisor icon
    Course Advisor
    Whatsapp/Chat icon