For enquiries call:
+1-469-442-0620
HomeBlogData ScienceWhat Is Data Splitting in Learn and Test Data?
Data is the fuel of every machine learning algorithm, on which statistical inferences are made and predictions are done. Consequently, it is important to collect the data, clean it and use it with maximum efficacy. A decent data sampling can guarantee accurate predictions and drive the whole ML project forward whereas a bad data sampling can lead to incorrect predictions. Before diving into the sampling techniques, let us understand what the population is and how does it differ from a sample?
The population is the assortment or the collection of the components which shares a few of the other characteristics for all intents and purposes. The total number of observations is said to be the size of the population. For more information, check out Data courses online.
Image Source
The sample is a subset of the population. The process of choosing a sample from a given set of the population is known as sampling. The number of components in the example is the sample size.
Data sampling refers to statistical approaches for picking observations from the domain to estimate a population parameter. Whereas data resampling refers to the drawing of repeated samples from the main or original source of data. It is the non-parametric procedure of statistical extrapolation. It produces unique sample distributions based on the original data and is used to improve the accuracy and intuitively measure the uncertainty of the population parameter.
Sampling methods can be divided into two parts:
The distinction between the two is that the example of determination depends on randomization. With randomization, each component persuades equivalent opportunity and is important for test for study.
Probability Sampling – It is a method in which each element of a given population has an equivalent chance of being selected.
For instance, consider a population size of 20 (1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19.20)
Suppose we want the element with the number 3 and a sample size of 5. The next selection will be made at an interval of 20/5 i.e. 4 so 3 + 4 = 7 so 3,7,11 and so on.
Stratified sampling – In this sampling process, the total group is subdivided into smaller groups, known as the strata, to obtain a sampling process.
Assume that we need to identify the average number of votes in three different cities to elect a representative. City x has 1 million citizens, city y has 2 million citizens, city z has 3 million citizens. We can randomly choose a sample size of 60 for the entire population. But if you notice, the random samples are not balanced with respect to the different cities. Hence there could be an estimation error. To overcome this, we may choose a random sample of 10,20,30 from city x, y, z respectively. We can therefore minimize the total estimated error.
Non-probability sampling – In a non-probability sampling method, each instance of a population does not have an equivalent chance of being selected. There is an element of risk of ending up with a non-representative sample which might not bring out a comprehensive outcome.
Convenience sampling - This sampling technique includes people or samples that are easy to reach. Though it is the easiest methodology to collect a sample it runs a high risk of not being representative of a population.
For instance, consider a population size of 20 (1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19.20)
The surveyor wants the person 4,7,11,18 to participate, hence it can create selection bias.
Quota sampling – In Quota sampling methods the sample or the instances are chosen based on their traits or characteristics which matches with the population.
For instance, consider a population size of 20 (1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19.20)
Consider a quota in multiple of 4 - (4,8,12,16,20)
A nominates P, P nominates G, G nominates M
A > P > G > M
The non-probability sampling technique may lead to selection bias and population misrepresentation.
Image Source
There is a dedicated library to tackle imbalanced datasets in Python - known as imblearn. Imblearn has multiple methods to handle undersampling and oversampling. Enroll in KnowledgeHut Data courses online to advance your career in Data Science.
Python is bundled with overpowered ML library. The train_test_Split() module from Scikit-Learn library is one of the major python modules that provides a function to split the datasets into multiple subsets in different ways or let us say randomly into training and validation datasets. The parameter train_size takes a fraction between zero and one for specifying the training size. The remaining samples in the original data set are for testing purposes. The record which is selected for training and test sets are randomly sampled. The simplest method train_test_split() or the split_train_test() are more or less the same.
A visual representation of training or test data:
It is important to note that the test data adheres to the following conditions:
Be large enough to fetch statistically significant results
Is a representation of the whole dataset. One must not pick the test set with different traits/characteristics of the training set.
Never train on test data - don’t get fooled by good results and high accuracy. It might be the case that one has accidentally trained the model on the test data.
The train_test_split() is coupled with additional features:
Conclusion:
Sampling is an ongoing process of accumulating the information or the observations on an estimate of the population variable. We learnt about sampling types - probability sampling procedure and non-probability sampling procedure. Resampling is a repeated process to draw samples from the main data source. And finally, we learnt about training, testing and splitting the data which are used to measure the performance of the model. The training and testing of the model are done to understand the data discrepancies and develop a better understanding of the machine learning model.
Name | Date | Fee | Know more |
---|