Machine Learning Tutorial

By KnowledgeHut .

The input data to a learning algorithm usually has a row x column structure, and is usually a CSV file. CSV refers to comma separated values which is a simple file format that helps in storing tabular data structure. This CSV format can be easily loaded into a Pandas dataframe with the help of the read_csv function. The CSV file can be loading using other libraries as well, and we will look at a few approaches in this post. Let us now load CSV files in different methods: Using Python standard library There are built-in modules, such as ‘csv’, that contains a reader function, which can be used to read the data present in a csv file. The CSV file can be opened in read mode, and the reader function can be used. Below is an example demonstrating the same: import numpy as np import csv path = path to csv file with open(path,'r') as infile: reader = csv.reader(infile,delimiter = ',') headers = next(reader) data = list(reader) data = np.array(data).astype(float) The headers or the column names can be printed using the following line of code: print(headers) The dimensions of the dataset can be determined using the shape attribute as shown in the following line of code: print(data.shape) Output: 250, 302 The nature of data can be determined by examining the first few rows of the dataset using the below line of code: data[:2] Using numpy package The numpy package has a function named ‘loadtxt’ that can be used to read CSV data. Below is an example demonstrating the same using StringIO. from numpy import loadtxt from io import StringIO c = StringIO("0 1 2 \n3 4 5") data = loadtxt(c) print(data.shape) Output: (2, 3) Using pandas package There are a few things to keep in mind while dealing with CSV files using Pandas package. The file header is basically the name of the column which describes that type of data the column holds. If the file already has a header, the function automatically assigns the same names to every column, otherwise every column needs to be manually named. In any case, we need to explicitly mention in the read_csv function whether or not the CSV file contain header. Comments in a CSV file are written using the # symbol. Let us look at an example to understand how the CSV file is read as a dataframe. import numpy as np import pandas as pd #Obtain the dataset df = pd.read_csv("path to csv file", sep=",") df[:5] Output: target012 ...295296297298299 0 1.0 -0.098 2.165 0.681 ... -2.097 1.051 -0.414 1.038 -1.065 1 0.0 1.081 -0.973 -0.383 ... -1.624 -0.458 -1.099 -0.936 0.973 2 1.0 -0.523 -0.089 -0.348 ... -1.165 -1.544 0.004 0.800 -1.211 3 1.0 0.067 -0.021 0.392 ... 0.467 -0.562 -0.254 -0.533 0.238 4 1.0 2.347 -0.831 0.511 ... 1.378 1.246 1.478 0.428 0.253 [5 rows x 302 columns] Conclusion In this post, we saw how input data can be loaded for machine learning projects.

1. Machine Learning Overview

2. Machine Learning Terminologies

3. Demystifying Machine Learning

4. Applications of Machine Learning

5. Methods for Machine Learning

6. Underfitting and Overfitting in Machine Learning

7. Data Loading for ML Projects

8. Introduction to Data in Machine Learning

9. Why Data Pre-processing?

10. Normalization

11. Numpy

12. K-Nearest Neighbors (KNN)

13. Hyperparameter Tuning

14. Pre-procesing Data

15. What is Clustering in Machine Learning?

16. Overview - Regression & Logistic Regression

17. Linear Regression(Python Implementation)

18. Softmax Regression using TensorFlow

19. What is Linear Regression?

20. Linear Regression using PyTorch

21. Decision Trees

22. Introduction To Machine Learning using Python

23. Learning Model Building in Scikit-learn: A Python Machine Learning Library

24. Confusion matrix

25. Machine learning metrics

26. Improving Performance of ML Models

27. How to get synonyms/antonyms from NLTK WordNet in Python?

28. Removing stop words with NLTK in Python

29. Tokenize text using NLTK in Python

Data Loading for ML Projects

Let us now load CSV files in different methods:

Using Python standard library

There are built-in modules, such as ‘csv’, that contains a reader function, which can be used to read the data present in a csv file. The CSV file can be opened in read mode, and the reader function can be used. Below is an example demonstrating the same:

import numpy as np 
import csv 
path = path to csv file 
with open(path,'r') as infile: 
reader = csv.reader(infile,delimiter = ',') 
headers = next(reader) 
data = list(reader) 
data = np.array(data).astype(float)

The headers or the column names can be printed using the following line of code:

print(headers)

The dimensions of the dataset can be determined using the shape attribute as shown in the following line of code:

print(data.shape) 
Output:

250, 302

The nature of data can be determined by examining the first few rows of the dataset using the below line of code:

data[:2]

Using numpy package

The numpy package has a function named ‘loadtxt’ that can be used to read CSV data. Below is an example demonstrating the same using StringIO.

from numpy import loadtxt 
from io import StringIO 
c = StringIO("0 1 2 \n3 4 5") 
data = loadtxt(c) 
print(data.shape)

Output:

(2, 3)

Using pandas package

There are a few things to keep in mind while dealing with CSV files using Pandas package.

The file header is basically the name of the column which describes that type of data the column holds. If the file already has a header, the function automatically assigns the same names to every column, otherwise every column needs to be manually named.
In any case, we need to explicitly mention in the read_csv function whether or not the CSV file contain header.
Comments in a CSV file are written using the # symbol.

Let us look at an example to understand how the CSV file is read as a dataframe.

import numpy as np 
import pandas as pd 
#Obtain the dataset 
df = pd.read_csv("path to csv file", sep=",") 
df[:5]

Output:

target012 ...295296297298299

0  1.0 -0.098 2.165 0.681 ...  -2.097 1.051 -0.414 1.038 -1.065

1  0.0 1.081 -0.973 -0.383 ...  -1.624 -0.458 -1.099 -0.936 0.973

2  1.0 -0.523 -0.089 -0.348 ...  -1.165 -1.544 0.004 0.800 -1.211

3  1.0 0.067 -0.021 0.392 ...  0.467 -0.562 -0.254 -0.533 0.238

4  1.0 2.347 -0.831 0.511 ...  1.378 1.246 1.478 0.428 0.253

[5 rows x 302 columns]

Conclusion

In this post, we saw how input data can be loaded for machine learning projects.

6-A Underfitting and Overfitting in Machine Learning

8-A Introduction to Data in Machine Learning

Your email address will not be published. Required fields are marked *

Comments

Vinu

After reading your article, I was amazed. I know that you explain it very well. And I hope that other readers will also experience how I feel after reading your article. Thanks for sharing.

Johnson M

Good and informative article.

Vinu

I enjoyed reading your articles. This is truly a great read for me. Keep up the good work!

Vinu

Awesome blog. I enjoyed reading this article. This is truly a great read for me. Keep up the good work!

best data science courses in India

Thanks for sharing this article!! Machine learning is a branch of artificial intelligence (AI) and computer science that focus on the uses of data and algorithms. I came to know a lot of information from this article.

View More Comments

Search

Machine Learning Tutorial

By KnowledgeHut .

Machine Learning Tutorial

Data Loading for ML Projects

Using Python standard library

Conclusion

Leave a Reply

Comments

Vinu

Johnson M

Vinu

Vinu

best data science courses in India