The input data to a learning algorithm usually has a row x column structure, and is usually a CSV file. CSV refers to comma separated values which is a simple file format that helps in storing tabular data structure. This CSV format can be easily loaded into a Pandas dataframe with the help of the read_csv function. The CSV file can be loading using other libraries as well, and we will look at a few approaches in this post.
Let us now load CSV files in different methods:
There are built-in modules, such as ‘csv’, that contains a reader function, which can be used to read the data present in a csv file. The CSV file can be opened in read mode, and the reader function can be used. Below is an example demonstrating the same:
import numpy as np import csv path = path to csv file with open(path,'r') as infile: reader = csv.reader(infile,delimiter = ',') headers = next(reader) data = list(reader) data = np.array(data).astype(float)
The headers or the column names can be printed using the following line of code:
The dimensions of the dataset can be determined using the shape attribute as shown in the following line of code:
The nature of data can be determined by examining the first few rows of the dataset using the below line of code:
Using numpy package
The numpy package has a function named ‘loadtxt’ that can be used to read CSV data. Below is an example demonstrating the same using StringIO.
from numpy import loadtxt from io import StringIO c = StringIO("0 1 2 \n3 4 5") data = loadtxt(c) print(data.shape)
Using pandas package
There are a few things to keep in mind while dealing with CSV files using Pandas package.
Let us look at an example to understand how the CSV file is read as a dataframe.
import numpy as np import pandas as pd #Obtain the dataset df = pd.read_csv("path to csv file", sep=",") df[:5]
0 1.0 -0.098 2.165 0.681 ... -2.097 1.051 -0.414 1.038 -1.065
1 0.0 1.081 -0.973 -0.383 ... -1.624 -0.458 -1.099 -0.936 0.973
2 1.0 -0.523 -0.089 -0.348 ... -1.165 -1.544 0.004 0.800 -1.211
3 1.0 0.067 -0.021 0.392 ... 0.467 -0.562 -0.254 -0.533 0.238
4 1.0 2.347 -0.831 0.511 ... 1.378 1.246 1.478 0.428 0.253
[5 rows x 302 columns]
In this post, we saw how input data can be loaded for machine learning projects.