Machine Learning Tutorial

The input data to a learning algorithm usually has a row x column structure, and is usually a CSV file. CSV refers to comma separated values which is a simple file format that helps in storing tabular data structure. This CSV format can be easily loaded into a Pandas dataframe with the help of the read_csv function. The CSV file can be loading using other libraries as well, and we will look at a few approaches in this post.

Let us now load CSV files in different methods:

Using Python standard library

There are built-in modules, such as ‘csv’, that contains a reader function, which can be used to read the data present in a csv file. The CSV file can be opened in read mode, and the reader function can be used. Below is an example demonstrating the same:

import numpy as np
import csv
path = path to csv file
with open(path,'r') as infile:
data = np.array(data).astype(float) 

The headers or the column names can be printed using the following line of code:

print(headers)

The dimensions of the dataset can be determined using the shape attribute as shown in the following line of code:

print(data.shape)
Output: 
250, 302

The nature of data can be determined by examining the first few rows of the dataset using the below line of code:

data[:2]

Using numpy package

The numpy package has a function named ‘loadtxt’ that can be used to read CSV data. Below is an example demonstrating the same using StringIO.

from numpy import loadtxt
from io import StringIO
c = StringIO("0 1 2 \n3 4 5")
print(data.shape) 

Output:

(2, 3)

Using pandas package

There are a few things to keep in mind while dealing with CSV files using Pandas package.

• The file header is basically the name of the column which describes that type of data the column holds. If the file already has a header, the function automatically assigns the same names to every column, otherwise every column needs to be manually named.
• In any case, we need to explicitly mention in the read_csv function whether or not the CSV file contain header.
• Comments in a CSV file are written using the # symbol.

Let us look at an example to understand how the CSV file is read as a dataframe.

import numpy as np
import pandas as pd
#Obtain the dataset
df = pd.read_csv("path to csv file", sep=",")
df[:5] 

Output:

target012 ...295296297298299
1. 0  1.0 -0.098 2.165 0.681 ...  -2.097 1.051 -0.414 1.038 -1.065
2. 1  0.0 1.081 -0.973 -0.383 ...  -1.624 -0.458 -1.099 -0.936 0.973
3. 2  1.0 -0.523 -0.089 -0.348 ...  -1.165 -1.544 0.004 0.800 -1.211
4. 3  1.0 0.067 -0.021 0.392 ...  0.467 -0.562 -0.254 -0.533 0.238
5. 4  1.0 2.347 -0.831 0.511 ...  1.378 1.246 1.478 0.428 0.253
[5 rows x 302 columns]

Conclusion

In this post, we saw how input data can be loaded for machine learning projects.

Pavan Kumar Reddy B

Test

Wow what great post about machine learning I never read such a blog before, this is very interesting!