top

Search

Machine Learning Tutorial

What is data? It is the unprocessed, raw facts that can be extracted from various resources. Data is generated every millisecond and most of the data generated is unstructured. This means it doesn’t have a specific format. This is the reason why many machine learning algorithms don’t give great results even if a large amount of data is fed as input. Data is not in the right format; it is unstructured and hence difficult to process and get consumed. What is information? It is the processed form of data, i.e. data that has been cleaned and made sense. This information gives meaningful insights to users about specific aspects. Data in machine learning Data in machine learning is usually in the form of text that needs to be converted to numbers since it is difficult for machines to infer from text data. Input data to learning algorithms usually has a tabular structure that consists of rows and columns. The columns indicate the name of the feature and the rows have data of every feature. Data is split into different sets so that a part of the dataset can be trained upon, a part can be validated and a part can be used for testing purposes. Training data: This is the input dataset which is fed to the learning algorithm. Once the dataset is pre-processed and cleaned, it is fed to the algorithm. Sometimes, predefined datasets are readily available on multiple websites which can be downloaded and used. Some predefined data sets need to be cleaned and verified but some of them are usually cleaned beforehand. The machine learning model learns from this data and tries to fit a model on this data. Validation data: This is similar to the test set, but it is used on the model frequently so as to knowhow well the model performs on never-before seen data. Based on the results obtained by passing the validation set to the learning algorithm, decision can be made as to how the algorithm can be made to learn better- the hyper parameters can be tweaked so that the model gives better results on this validation set in the next run, the features can be combined or new features can be created which better describe the data, thereby yielding better results. Test data: This is the data on which the model’s performance/its ability to generalize is judged. In theend, the model’s performance can be determined based on how well it reacts to never-before-seen data. This is the data, which is used to test how well the model would generalize on new data. This is a way of knowing whether the model actually understood and learnt the patterns or it just overfit or underfit the data. It is important to understand that good quality data (less to no noise, less to no redundancy, less to no discrepancies) in large amounts yields great results when the right learning algorithm is applied on the input data. Conclusion In this post, we understood the significance of data in machine learning, and different types of data associated with machine learning. 
logo

Machine Learning Tutorial

Introduction to Data in Machine Learning

What is data? 

It is the unprocessed, raw facts that can be extracted from various resources. Data is generated every millisecond and most of the data generated is unstructured. This means it doesn’t have a specific format. This is the reason why many machine learning algorithms don’t give great results even if a large amount of data is fed as input. Data is not in the right format; it is unstructured and hence difficult to process and get consumed. 

What is information? 

It is the processed form of data, i.e. data that has been cleaned and made sense. This information gives meaningful insights to users about specific aspects. 

Data in machine learning 

Data in machine learning is usually in the form of text that needs to be converted to numbers since it is difficult for machines to infer from text data. Input data to learning algorithms usually has a tabular structure that consists of rows and columns. The columns indicate the name of the feature and the rows have data of every feature. 

Data is split into different sets so that a part of the dataset can be trained upon, a part can be validated and a part can be used for testing purposes. 

  • Training data: This is the input dataset which is fed to the learning algorithm. Once the dataset is pre-processed and cleaned, it is fed to the algorithm. Sometimes, predefined datasets are readily available on multiple websites which can be downloaded and used. Some predefined data sets need to be cleaned and verified but some of them are usually cleaned beforehand. The machine learning model learns from this data and tries to fit a model on this data. 
  • Validation data: This is similar to the test set, but it is used on the model frequently so as to knowhow well the model performs on never-before seen data. Based on the results obtained by passing the validation set to the learning algorithm, decision can be made as to how the algorithm can be made to learn better- the hyper parameters can be tweaked so that the model gives better results on this validation set in the next run, the features can be combined or new features can be created which better describe the data, thereby yielding better results. 
  • Test data: This is the data on which the model’s performance/its ability to generalize is judged. In theend, the model’s performance can be determined based on how well it reacts to never-before-seen data. This is the data, which is used to test how well the model would generalize on new data. This is a way of knowing whether the model actually understood and learnt the patterns or it just overfit or underfit the data. 

It is important to understand that good quality data (less to no noise, less to no redundancy, less to no discrepancies) in large amounts yields great results when the right learning algorithm is applied on the input data. 

Conclusion 

In this post, we understood the significance of data in machine learning, and different types of data associated with machine learning. 

Leave a Reply

Your email address will not be published. Required fields are marked *