For enquiries call:

+1-469-442-0620

For enquiries call:

+1-469-442-0620

All Courses

Bootcamps

Enterprise

Resources

Home
Blog
Data Science
What Is Data Splitting in Learn and Test Data?

HomeBlogData ScienceWhat Is Data Splitting in Learn and Test Data?

What Is Data Splitting in Learn and Test Data?

Blog Author

Dipayan Ghatak

Published

05th Sep, 2023

Views

Read TimeRead it in

8 Mins

In this article

What Is Data Splitting in Learn and Test Data?

Data is the fuel of every machine learning algorithm, on which statistical inferences are made and predictions are done. Consequently, it is important to collect the data, clean it and use it with maximum efficacy. A decent data sampling can guarantee accurate predictions and drive the whole ML project forward whereas a bad data sampling can lead to incorrect predictions. Before diving into the sampling techniques, let us understand what the population is and how does it differ from a sample?

The population is the assortment or the collection of the components which shares a few of the other characteristics for all intents and purposes. The total number of observations is said to be the size of the population. For more information, check out Data courses online.

Image Source

The sample is a subset of the population. The process of choosing a sample from a given set of the population is known as sampling. The number of components in the example is the sample size.

Data sampling refers to statistical approaches for picking observations from the domain to estimate a population parameter. Whereas data resampling refers to the drawing of repeated samples from the main or original source of data. It is the non-parametric procedure of statistical extrapolation. It produces unique sample distributions based on the original data and is used to improve the accuracy and intuitively measure the uncertainty of the population parameter.

Sampling methods can be divided into two parts:

Probability sampling procedure
Non-probability sampling procedure

The distinction between the two is that the example of determination depends on randomization. With randomization, each component persuades equivalent opportunity and is important for test for study.

Probability Sampling – It is a method in which each element of a given population has an equivalent chance of being selected.

Simple random sampling –For instance, a classroom has 100 students, and each student has an equal chance of getting selected as the class representative.
Systematic sampling- It is a sampling technique in which the first element is selected at random, and others get selected based on a fixed sampling interval.

For instance, consider a population size of 20 (1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19.20)

Suppose we want the element with the number 3 and a sample size of 5. The next selection will be made at an interval of 20/5 i.e. 4 so 3 + 4 = 7 so 3,7,11 and so on.

Stratified sampling – In this sampling process, the total group is subdivided into smaller groups, known as the strata, to obtain a sampling process.

Assume that we need to identify the average number of votes in three different cities to elect a representative. City x has 1 million citizens, city y has 2 million citizens, city z has 3 million citizens. We can randomly choose a sample size of 60 for the entire population. But if you notice, the random samples are not balanced with respect to the different cities. Hence there could be an estimation error. To overcome this, we may choose a random sample of 10,20,30 from city x, y, z respectively. We can therefore minimize the total estimated error.

Reservoir sampling is a randomized algorithm. It is used to select k out of n samples. The n is generally very large or unknown. For instance, reservoir sampling can be used to obtain k out of the number of fish in a lake.
Cluster sampling - samples are taken as subgroup /clusters of the population. These subgroups are selected at random.

Image Source

Non-probability sampling – In a non-probability sampling method, each instance of a population does not have an equivalent chance of being selected. There is an element of risk of ending up with a non-representative sample which might not bring out a comprehensive outcome.

Convenience sampling - This sampling technique includes people or samples that are easy to reach. Though it is the easiest methodology to collect a sample it runs a high risk of not being representative of a population.

For instance, consider a population size of 20 (1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19.20)

The surveyor wants the person 4,7,11,18 to participate, hence it can create selection bias.

Quota sampling – In Quota sampling methods the sample or the instances are chosen based on their traits or characteristics which matches with the population.

For instance, consider a population size of 20 (1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19.20)

Consider a quota in multiple of 4 - (4,8,12,16,20)

Judgement sampling - Also known as selective sampling. Here individuals are asked to participate.
Snowball sampling - In this sampling technique, an individual element/person can nominate further elements/people known to them. It is only applicable when the sampling frame is difficult to identify.

A nominates P, P nominates G, G nominates M

A > P > G > M

The non-probability sampling technique may lead to selection bias and population misrepresentation.

Image Source

We often come across the case of an imbalanced dataset.
Resampling is a technique used to overcome or to deal with imbalanced datasets.
It includes removing samples/elements from the majority class i.e., undersampling
Adding more instances from the minority class i.e., Oversampling

There is a dedicated library to tackle imbalanced datasets in Python - known as imblearn. Imblearn has multiple methods to handle undersampling and oversampling. Enroll in KnowledgeHut Data courses online to advance your career in Data Science.

Image Source

Tomek Links for under-sampling - pairs of examples from opposite classes in close instances
Majority elements are eliminated from the Tomek Links which intuitively provides a better understanding and decision boundary for ML classifier.
SMOTE for oversampling - Synthetic Minority Oversampling Technique - works by increasing new examples from the minority cases. It is a statistical technique of increasing or generating the number of instances in the dataset in a more balanced manner.

Image Source

Pick a minority class as the input vector.
Discover its k closest neighbors (k_neighbors is indicated as a contention in the SMOTE())
Pick one of these neighbors and spot a synthetic point anyplace on the line joining the point viable and its picked neighbor
Rehash the above steps until it is adjusted or balanced
Other must-read sampling methods - Near miss, cluster centroids for under sampling, ADASYN and bSMOTE for oversampling

Train-Test split

Python is bundled with overpowered ML library. The train_test_Split() module from Scikit-Learn library is one of the major python modules that provides a function to split the datasets into multiple subsets in different ways or let us say randomly into training and validation datasets. The parameter train_size takes a fraction between zero and one for specifying the training size. The remaining samples in the original data set are for testing purposes. The record which is selected for training and test sets are randomly sampled. The simplest method train_test_split() or the split_train_test() are more or less the same.

train set – the subset of the dataset to train a model
test set - the subset of the dataset to test the trained model

The train-test method is used to measure the performance of ML algorithms
It is appropriate to use this procedure when the dataset is very large
For any supervised Machine learning algorithms, train-test split can be implemented.
Involves taking the data set as a whole and further subdividing it into two subsets
The training dataset is used to fit the model
The test dataset serves as an input to the model
The model predictions are made on the test data
The output (prediction) is compared to the expected values
The ultimate objective is to evaluate the performance of the said ML model against the new or unseen data.

A visual representation of training or test data:

Image Source

It is important to note that the test data adheres to the following conditions:

Be large enough to fetch statistically significant results

Is a representation of the whole dataset. One must not pick the test set with different traits/characteristics of the training set.

Never train on test data - don’t get fooled by good results and high accuracy. It might be the case that one has accidentally trained the model on the test data.

The train_test_split() is coupled with additional features:

a random seed generator as random_state parameter – this ensures which samples go to training and which go to the test set
It takes multiple data sets with the matching number of rows and splits them on similar indices.
The train_test_split returns four variables
- train_X - which covers X features of the training set.
- train_y – which contains the value of a response variable from the training set
- test_X – which includes X features of the test set
- test_y – which consists of values of the response variable for the test set.

There is no exact rule to split the data by 80:20 or 70:30; it depends on the data and the target variable. Some of the data scientists use a range of 60% to 80% for training and the rest for testing the model.
To find the length or the number of records we use len function of python > len(X_train), len (X_test)
The model is built by using the training set and is tested using the test set
X_train and y_train contain the independent features or variables and response variable values for training datasets respectively.
On the other hand, X_test and y_test include the independent features and response variables values for the test dataset respectively.

Conclusion:

Sampling is an ongoing process of accumulating the information or the observations on an estimate of the population variable. We learnt about sampling types - probability sampling procedure and non-probability sampling procedure. Resampling is a repeated process to draw samples from the main data source. And finally, we learnt about training, testing and splitting the data which are used to measure the performance of the model. The training and testing of the model are done to understand the data discrepancies and develop a better understanding of the machine learning model.

Dipayan Ghatak

Project Manager

Leading Projects across geographies in Microsoft Consultant Services.

Share This Article

Ready to Master the Skills that Drive Your Career?

Avail your free 1:1 mentorship session.

Upcoming Data Science Batches & Dates

Name	Date	Fee	Know more

Course Advisor