Machine Learning Tutorial

By KnowledgeHut .

Scikit learn is a module in Python that is used for data analysis and data mining purposes. Features of scikit-learn module It is a simple and efficient tool. It can be used to implement various algorithms such as classification, regression and clustering. It is open-source and can be used for production code. It can be accessed and reused in different contexts. Pre-requisites NumpyScipy and its respective dependencies How to install scikit-learn pip install scikit-learn Following are the steps in implementing learning algorithm using scikit-learn: Loading data A dataset can be collected and loaded or a pre-defined dataset can be loaded. It has two features named features and responses. Features are nothing but attributes or variables present in the dataset. They are represented using a ‘feature matrix’. Response is also known as a target variable or a label. It is the output which depends on the feature variables. A single response column known as ‘response vector’ is present. Data can be loaded in different ways and some of them have been demonstrated below: Using Python standard library There are built-in modules, such as ‘csv’, that contains a reader function, which can be used to read the data present in a csv file. The CSV file can be opened in read mode, and the reader function can be used. Below is an example demonstrating the same: import numpy as np import csv path = path to csv file with open(path,'r') as infile: reader = csv.reader(infile,delimiter = ',') headers = next(reader) data = list(reader) data = np.array(data).astype(float) The headers or the column names can be printed using the following line of code: print(headers) The dimensions of the dataset can be determined using the shape attribute as shown in the following line of code: print(data.shape) Output: 250, 302 The nature of data can be determined by examining the first few rows of the dataset using the below line of code: data[:2] Using numpy package The numpy package has a function named ‘loadtxt’ that can be used to read CSV data. Below is an example demonstrating the same using StringIO. from numpy import loadtxt from io import StringIO c = StringIO("0 1 2 \n3 4 5") data = loadtxt(c) print(data.shape) Output: (2, 3) Using pandas package There are a few things to keep in mind while dealing with CSV files using Pandas package. The file header is basically the name of the column which describes that type of data the column holds. If the file already has a header, the function automatically assigns the same names to every column, otherwise every column needs to be manually named. In any case, we need to explicitly mention in the read_csv function whether or not the CSV file contain header. Comments in a CSV file are written using the # symbol. Let us look at an example to understand how the CSV file is read as a dataframe. import numpy as np import pandas as pd #Obtain the dataset df = pd.read_csv("path to csv file", sep=",") df[:5] Output: id target 0 1 2 ... 295 296 297 298 299 0 0 1.0 -0.098 2.165 0.681 ... -2.097 1.051 -0.414 1.038 -1.065 1 1 0.0 1.081 -0.973 -0.383 ... -1.624 -0.458 -1.099 -0.936 0.973 2 2 1.0 -0.523 -0.089 -0.348 ... -1.165 -1.544 0.004 0.800 -1.211 3 3 1.0 0.067 -0.021 0.392 ... 0.467 -0.562 -0.254 -0.533 0.238 4 4 1.0 2.347 -0.831 0.511 ... 1.378 1.246 1.478 0.428 0.253 Loading a pre-defined dataset It can be done using the below code. from sklearn.datasets import load_iris iris = load_iris() #feature matrix and target is stored in 2 variables X = iris.data y = iris.target feature_names = iris.feature_names target_names = iris.target_names #feature names and targets are printed print("Feature names:", feature_names) print("Target names:", target_names) #numpy arrays x and y print("\nType of X is:", type(X)) #first 5 input rows are printed to understand the type of data present in the dataset print("\nFirst 5 rows of X:\n", X[:5]) Splitting the dataset The next important step in implementing a learning algorithm is to split the dataset into training, testing and validation dataset. Data is split into different sets so that a part of the dataset can be trained upon, a part can be validated and a part can be used for testing purposes. Training data: This is the input dataset which is fed to the learning algorithm. Once the dataset is pre-processed and cleaned, it is fed to the algorithm. Sometimes, predefined datasets are readily available on multiple websites which can be downloaded and used. Some predefined data sets need to be cleaned and verified but some of them are usually cleaned beforehand. The machine learning model learns from this data and tries to fit a model on this data. Validation data: This is similar to the test set, but it is used on the model frequently so as to knowhow well the model performs on never-before seen data. Based on the results obtained by passing the validation set to the learning algorithm, decision can be made as to how the algorithm can be made to learn better- the hyper parameters can be tweaked so that the model gives better results on this validation set in the next run, the features can be combined or new features can be created which better describe the data, thereby yielding better results. Test data: This is the data on which the model’s performance/its ability to generalize is judged. In theend, the model’s performance can be determined based on how well it reacts to never-before-seen data. This is the data, which is used to test how well the model would generalize on new data. This is a way of knowing whether the model actually understood and learnt the patterns or it just overfit or underfit the data. It is important to understand that good quality data (less to no noise, less to no redundancy, less to no discrepancies) in large amounts yields great results when the right learning algorithm is applied on the input data. The dataset needs to be split into training and test datasets, so that once the training is completed on the training dataset, the performance of the learning model is tested on the test dataset. Usually, 80 percent of the data is used for training and 20 percent of the data is assigned for testing purposes. This can be achieved using the scikit-learn library, that has a function named train_test_split. The ‘test_size’ parameter helps in dividing the dataset into training and test datasets. from sklearn.model_selection import train_test_split X_train, X_test, Y_train, Y_test = train_test_split(X,Y, test_size=0.2) Feature scaling This is one of the most important steps in data pre-processing. It refers to standardizing the range of independent variables or features present within the dataset. When all the variables are transformed to the same scale, it is easier to work with machine learning equations. This can be achieved using the ‘StandardScaler’ class that is present in the scikit-learn library. The training dataset has to first be fit on the learning model and then transformed. On the other hand, the test dataset needs to just be transformed. from sklearn.preprocessing import StandardScaler sc_X = StandardScaler() X_train = sc_X.fit_transform(X_train) X_test = sc_X.transform(X_test) Model training Let us look at how a model can be trained sing KNN algorithm. from sklearn.datasets import load_iris iris = load_iris() #loading the iris dataset The feature matrix and respons evectors are stored X = iris.data y = iris.target #x and y are split into training and testing datasets from sklearn.model_selection import train_test_split X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.4, random_state=1) #model is trained on the training data from sklearn.neighbors import KNeighborsClassifier knn = KNeighborsClassifier(n_neighbors=3) knn.fit(X_train, y_train) #predictions are made on the test data y_pred = knn.predict(X_test) #actual response and predicted response is compared from sklearn import metrics print("kNN model accuracy:", metrics.accuracy_score(y_test, y_pred)) #predictions for sample data sample = [[3, 5, 4, 2], [2, 3, 5, 4]] preds = knn.predict(sample) pred_species = [iris.target_names[p] for p in preds] print("Predictions:", pred_species) #the model is saved from sklearn.externals import joblib joblib.dump(knn, 'iris_knn.pkl') Output: kNN model accuracy: 0.9833333333333333 Predictions: ['versicolor', 'virginica'] Out[13]: ['iris_knn.pkl'] Advantages of using scikit-learn It provides a consistent interface to implement learning algorithms. It has good documentation, and a community of helpful users. It consists of many hyperparameters which can be tuned. Conclusion In this post, we saw how scikit-learn can be used to implement machine learning algorithms with ease.

1. Machine Learning Overview

2. Machine Learning Terminologies

3. Demystifying Machine Learning

4. Applications of Machine Learning

5. Methods for Machine Learning

6. Underfitting and Overfitting in Machine Learning

7. Data Loading for ML Projects

8. Introduction to Data in Machine Learning

9. Why Data Pre-processing?

10. Normalization

11. Numpy

12. K-Nearest Neighbors (KNN)

13. Hyperparameter Tuning

14. Pre-procesing Data

15. What is Clustering in Machine Learning?

16. Overview - Regression & Logistic Regression

17. Linear Regression(Python Implementation)

18. Softmax Regression using TensorFlow

19. What is Linear Regression?

20. Linear Regression using PyTorch

21. Decision Trees

22. Introduction To Machine Learning using Python

23. Learning Model Building in Scikit-learn: A Python Machine Learning Library

24. Confusion matrix

25. Machine learning metrics

26. Improving Performance of ML Models

27. How to get synonyms/antonyms from NLTK WordNet in Python?

28. Removing stop words with NLTK in Python

29. Tokenize text using NLTK in Python

Learning Model Building in Scikit-learn: A Python Machine Learning Library

Scikit learn is a module in Python that is used for data analysis and data mining purposes.

Features of scikit-learn module

It is a simple and efficient tool.
It can be used to implement various algorithms such as classification, regression and clustering.
It is open-source and can be used for production code.
It can be accessed and reused in different contexts.

Pre-requisites

Numpy
Scipy and its respective dependencies

How to install scikit-learn

pip install scikit-learn

Following are the steps in implementing learning algorithm using scikit-learn:

Loading data

A dataset can be collected and loaded or a pre-defined dataset can be loaded. It has two features named features and responses.

Features are nothing but attributes or variables present in the dataset. They are represented using a ‘feature matrix’.

Response is also known as a target variable or a label. It is the output which depends on the feature variables. A single response column known as ‘response vector’ is present.

Data can be loaded in different ways and some of them have been demonstrated below:

Using Python standard library

There are built-in modules, such as ‘csv’, that contains a reader function, which can be used to read the data present in a csv file. The CSV file can be opened in read mode, and the reader function can be used. Below is an example demonstrating the same:

import numpy as np 
import csv 
path = path to csv file 
with open(path,'r') as infile: 
reader = csv.reader(infile,delimiter = ',') 
headers = next(reader) 
data = list(reader) 
data = np.array(data).astype(float)

The headers or the column names can be printed using the following line of code:

print(headers)

The dimensions of the dataset can be determined using the shape attribute as shown in the following line of code:

print(data.shape)

Output:

250, 302

The nature of data can be determined by examining the first few rows of the dataset using the below line of code:

data[:2]

Using numpy package

The numpy package has a function named ‘loadtxt’ that can be used to read CSV data. Below is an example demonstrating the same using StringIO.

from numpy import loadtxt 
from io import StringIO 
c = StringIO("0 1 2 \n3 4 5") 
data = loadtxt(c) 
print(data.shape)

Output:

(2, 3)

Using pandas package

There are a few things to keep in mind while dealing with CSV files using Pandas package.

The file header is basically the name of the column which describes that type of data the column holds. If the file already has a header, the function automatically assigns the same names to every column, otherwise every column needs to be manually named.
In any case, we need to explicitly mention in the read_csv function whether or not the CSV file contain header.
Comments in a CSV file are written using the # symbol.

Let us look at an example to understand how the CSV file is read as a dataframe.

import numpy as np 
import pandas as pd 
#Obtain the dataset 
df = pd.read_csv("path to csv file", sep=",") 
df[:5]

Output:

 id target   0   1   2 ...   295  296  297  298  299  0 0 1.0 -0.098 2.165 0.681 ... -2.097 1.051 -0.414 1.038 -1.065  1 1 0.0 1.081 -0.973 -0.383 ... -1.624 -0.458 -1.099 -0.936 0.973 2 2 1.0 -0.523 -0.089 -0.348 ...  -1.165 -1.544 0.004 0.800 -1.211 3 3 1.0 0.067 -0.021 0.392 ... 0.467 -0.562 -0.254 -0.533 0.238  4 4 1.0 2.347 -0.831 0.511 ... 1.378 1.246 1.478 0.428 0.253

Loading a pre-defined dataset

It can be done using the below code.

from sklearn.datasets import load_iris 
iris = load_iris() 
#feature matrix and target is stored in 2 variables 
X = iris.data 
y = iris.target 
feature_names = iris.feature_names 
target_names = iris.target_names 
#feature names and targets are printed 
print("Feature names:", feature_names) 
print("Target names:", target_names) 
#numpy arrays x and y 
print("\nType of X is:", type(X)) 
#first 5 input rows are printed to understand the type of data present in the dataset 
print("\nFirst 5 rows of X:\n", X[:5])

Splitting the dataset

The next important step in implementing a learning algorithm is to split the dataset into training, testing and validation dataset.

Data is split into different sets so that a part of the dataset can be trained upon, a part can be validated and a part can be used for testing purposes.

Training data:

This is the input dataset which is fed to the learning algorithm. Once the dataset is pre-processed and cleaned, it is fed to the algorithm. Sometimes, predefined datasets are readily available on multiple websites which can be downloaded and used. Some predefined data sets need to be cleaned and verified but some of them are usually cleaned beforehand. The machine learning model learns from this data and tries to fit a model on this data.

Validation data:

This is similar to the test set, but it is used on the model frequently so as to knowhow well the model performs on never-before seen data. Based on the results obtained by passing the validation set to the learning algorithm, decision can be made as to how the algorithm can be made to learn better- the hyper parameters can be tweaked so that the model gives better results on this validation set in the next run, the features can be combined or new features can be created which better describe the data, thereby yielding better results.

Test data: This is the data on which the model’s performance/its ability to generalize is judged. In theend, the model’s performance can be determined based on how well it reacts to never-before-seen data. This is the data, which is used to test how well the model would generalize on new data. This is a way of knowing whether the model actually understood and learnt the patterns or it just overfit or underfit the data.

It is important to understand that good quality data (less to no noise, less to no redundancy, less to no discrepancies) in large amounts yields great results when the right learning algorithm is applied on the input data.

The dataset needs to be split into training and test datasets, so that once the training is completed on the training dataset, the performance of the learning model is tested on the test dataset. Usually, 80 percent of the data is used for training and 20 percent of the data is assigned for testing purposes. This can be achieved using the scikit-learn library, that has a function named train_test_split. The ‘test_size’ parameter helps in dividing the dataset into training and test datasets.

from

sklearn.model_selection import train_test_split 
X_train, X_test, Y_train, Y_test = train_test_split(X,Y, test_size=0.2)

Feature scaling

This is one of the most important steps in data pre-processing. It refers to standardizing the range of independent variables or features present within the dataset. When all the variables are transformed to

the same scale, it is easier to work with machine learning equations. This can be achieved using the ‘StandardScaler’ class that is present in the scikit-learn library. The training dataset has to first be fit on the learning model and then transformed. On the other hand, the test dataset needs to just be transformed.

from sklearn.preprocessing import StandardScaler 
sc_X = StandardScaler() 
X_train = sc_X.fit_transform(X_train) 
X_test = sc_X.transform(X_test) 
Model training 
Let us look at how a model can be trained sing KNN algorithm. 
from sklearn.datasets import load_iris 
iris = load_iris() #loading the iris dataset 
The feature matrix and respons evectors are stored X = iris.data 
y = iris.target 
#x and y are split into training and testing datasets from sklearn.model_selection import train_test_split 
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.4, random_state=1) 
#model is trained on the training data 
from sklearn.neighbors import KNeighborsClassifier 
knn = KNeighborsClassifier(n_neighbors=3) 
knn.fit(X_train, y_train) 
#predictions are made on the test data 
y_pred = knn.predict(X_test) 
#actual response and predicted response is compared from sklearn import metrics 
print("kNN model accuracy:", metrics.accuracy_score(y_test, y_pred)) 
#predictions for sample data 
sample = [[3, 5, 4, 2], [2, 3, 5, 4]] 
preds = knn.predict(sample) 
pred_species = [iris.target_names[p] for p in preds] 
print("Predictions:", pred_species) 
#the model is saved 
from sklearn.externals import joblib 
joblib.dump(knn, 'iris_knn.pkl')

Output:

kNN model accuracy: 0.9833333333333333 
Predictions: ['versicolor', 'virginica'] 
Out[13]: ['iris_knn.pkl']

Advantages of using scikit-learn

It provides a consistent interface to implement learning algorithms.
It has good documentation, and a community of helpful users.
It consists of many hyperparameters which can be tuned.

Conclusion

In this post, we saw how scikit-learn can be used to implement machine learning algorithms with ease.

22-A Introduction To Machine Learning using Python

24-A Confusion matrix

Your email address will not be published. Required fields are marked *

Comments

Vinu

After reading your article, I was amazed. I know that you explain it very well. And I hope that other readers will also experience how I feel after reading your article. Thanks for sharing.

Johnson M

Good and informative article.

Vinu

I enjoyed reading your articles. This is truly a great read for me. Keep up the good work!

Vinu

Awesome blog. I enjoyed reading this article. This is truly a great read for me. Keep up the good work!

best data science courses in India

Thanks for sharing this article!! Machine learning is a branch of artificial intelligence (AI) and computer science that focus on the uses of data and algorithms. I came to know a lot of information from this article.

View More Comments

Search

Machine Learning Tutorial

By KnowledgeHut .

Machine Learning Tutorial

Learning Model Building in Scikit-learn: A Python Machine Learning Library

Features of scikit-learn module

Pre-requisites

How to install scikit-learn

Loading data

Using Python standard library

Splitting the dataset

Training data:

Validation data:

Feature scaling

Conclusion

Leave a Reply

Comments

Vinu

Johnson M

Vinu

Vinu

best data science courses in India