Machine Learning Tutorial

By KnowledgeHut .

Machine learning algorithms can be implemented from scratch (for the purpose of understanding how it works) or it can be used by implementing the module which is already present. In this post, we will understand what KNN or k-nearest neighbours is, and how it can be implemented in Python. What is kNN? It is a supervised algorithm that is widely used in data mining, pattern recognition and many other techniques. It is used in classification as well as regression problems. It is also known as ‘lazy learning’, ‘instance-based learning’ and ‘non-parametric learning. The reason behind calling kNN with these names will be cleared when you read this post. For example: Consider we have 2 classes-class ‘A’ and class ‘B’. Class ‘A’ consists of chocolates and class‘B’ consists of chips and other spicy items. Suppose one more entity needs to be put into one of these classes (‘A’ or ‘B’), this will be done based on the characteristics of the item. Suppose the item is spicy (assume someone has already tasted it), looks crispy and is orange/red in color, it can be classified as belonging to the spicy class or class ‘B’. On the other hand, if the item is sweet (again, assume someone has tasted it), looks soft, it can be classified as an item belonging to class ‘B’. Note: This is a hypothetical example, since not all eatables which are red belong to the spicy class andnot all eatables which are soft belong to the sweet class. kNN algorithm doesn’t really learn anything from the data it has been given, but tries to find a match for the newly given data point based on how similar it is to one of the classes in the training dataset. Hence the name ‘lazy learning’ is also used to refer to kNN. This means predictions are made using the training dataset itself. Hence the name ‘instance-based learning’. When a new instance of data is provided, it tries to find ‘k’ (a fixed integer) values from the training dataset which are very similar to the new instance of data. It doesn’t assume anything about the data while classifying new instances into one of the classes. Hence the name ‘non-parametric algorithm’. Below is an image showing what kNN is used for: To understand which of these ‘k’ values need to be calculated, the shortest distance between the new instance and the data point from the training set is considered. In general, for real-valued data, and data that is similar in type (classifying as sweet or spicy, finding heights, weights), Euclidean distance is used. But based on the nature of our data, various other methods can be used, such as Hamming distance, Jaccard distance, Cosine distance or Manhattan distance. How to find which value of ‘k’ is right for my dataset? The answer to this is, ‘it depends on the dataset’. Usually, the trial and error method is used to see what value of ‘k’ gives the highest accuracy in that specific case. Values ranging from 2 to 15 are experimented with. We will see the implementation of kNN from scratch. We will be using the iris dataset to implement this algorithm. Download it from here.It will be downloaded as a zip file which needs to be unzipped and the path of this CSV file has to be supplied to the below code. Implementation of kNN from scratch using Euclidean Distance Here’s the implementation: import csv import random import operator import math iris_data_path = "path to iris_data.csv" def loadDataset(filename, split, trainingSet=[] , testSet=[]): with open(filename, 'r') as csvfile: lines = csv.reader(csvfile) dataset = list(lines) for x in range(len(dataset)-1): for y in range(4): dataset[x][y] = float(dataset[x][y]) if random.random() < split: trainingSet.append(dataset[x]) else: testSet.append(dataset[x]) def euclideanDistance(instance1, instance2, length): distance = 0 for x in range(length): distance += pow((instance1[x] - instance2[x]), 2) return math.sqrt(distance) def getNeighbors(trainingSet, testInstance, k): distances = [] length = len(testInstance)-1 for x in range(len(trainingSet)): dist = euclideanDistance(testInstance, trainingSet[x], length) distances.append((trainingSet[x], dist)) distances.sort(key=operator.itemgetter(1)) neighbors = [] for x in range(k): neighbors.append(distances[x][0]) return neighbors def getResponse(neighbors): classVotes = {} for x in range(len(neighbors)): response = neighbors[x][-1] if response in classVotes: classVotes[response] += 1 else: classVotes[response] = 1 sortedVotes = sorted(classVotes.items(), key=operator.itemgetter(1), reverse=True) return sortedVotes[0][0] def getAccuracy(testSet, predictions): correct = 0 for x in range(len(testSet)): if testSet[x][-1] == predictions[x]: correct += 1 return (correct/float(len(testSet))) * 100.0 def main(): # prepare data trainingSet=[] testSet=[] split = 0.67 loadDataset(iris_data_path, split, trainingSet, testSet) print('Train set: ' + repr(len(trainingSet))) print('Test set: ' + repr(len(testSet))) generate predictions predictions=[] k = 3 for x in range(len(testSet)): neighbors = getNeighbors(trainingSet, testSet[x], k) result = getResponse(neighbors) predictions.append(result) print('> predicted=' + repr(result) + ', actual=' + repr(testSet[x][-1])) accuracy = getAccuracy(testSet, predictions) print('Accuracy: ' + repr(accuracy) + '%') main() Advantages of kNN Algorithm Easy to understand and implement. No need to build a model or tune parameters or make assumptions. Versatile algorithm- can be used for classification and regression. Disadvantages of kNN Algorithm It becomes slow when the number of independent variables increases.Conclusion In today's post, we understood how kNN algorithm works, how the value of 'k' is chosen based on different factors, its implementation from scratch, its advantages and its disadvantages.

1. Machine Learning Overview

2. Machine Learning Terminologies

3. Demystifying Machine Learning

4. Applications of Machine Learning

5. Methods for Machine Learning

6. Underfitting and Overfitting in Machine Learning

7. Data Loading for ML Projects

8. Introduction to Data in Machine Learning

9. Why Data Pre-processing?

10. Normalization

11. Numpy

12. K-Nearest Neighbors (KNN)

13. Hyperparameter Tuning

14. Pre-procesing Data

15. What is Clustering in Machine Learning?

16. Overview - Regression & Logistic Regression

17. Linear Regression(Python Implementation)

18. Softmax Regression using TensorFlow

19. What is Linear Regression?

20. Linear Regression using PyTorch

21. Decision Trees

22. Introduction To Machine Learning using Python

23. Learning Model Building in Scikit-learn: A Python Machine Learning Library

24. Confusion matrix

25. Machine learning metrics

26. Improving Performance of ML Models

27. How to get synonyms/antonyms from NLTK WordNet in Python?

28. Removing stop words with NLTK in Python

29. Tokenize text using NLTK in Python

K-Nearest Neighbors (KNN)

Machine learning algorithms can be implemented from scratch (for the purpose of understanding how it works) or it can be used by implementing the module which is already present.

In this post, we will understand what KNN or k-nearest neighbours is, and how it can be implemented in Python.

What is kNN?

It is a supervised algorithm that is widely used in data mining, pattern recognition and many other techniques. It is used in classification as well as regression problems. It is also known as ‘lazy learning’, ‘instance-based learning’ and ‘non-parametric learning. The reason behind calling kNN with these names will be cleared when you read this post.

For example: Consider we have 2 classes-class ‘A’ and class ‘B’. Class ‘A’ consists of chocolates and class‘B’ consists of chips and other spicy items. Suppose one more entity needs to be put into one of these classes (‘A’ or ‘B’), this will be done based on the characteristics of the item. Suppose the item is spicy (assume someone has already tasted it), looks crispy and is orange/red in color, it can be classified as belonging to the spicy class or class ‘B’. On the other hand, if the item is sweet (again, assume someone has tasted it), looks soft, it can be classified as an item belonging to class ‘B’.

Note: This is a hypothetical example, since not all eatables which are red belong to the spicy class andnot all eatables which are soft belong to the sweet class.

kNN algorithm doesn’t really learn anything from the data it has been given, but tries to find a match for the newly given data point based on how similar it is to one of the classes in the training dataset. Hence the name ‘lazy learning’ is also used to refer to kNN. This means predictions are made using the training dataset itself. Hence the name ‘instance-based learning’.

When a new instance of data is provided, it tries to find ‘k’ (a fixed integer) values from the training dataset which are very similar to the new instance of data. It doesn’t assume anything about the data while classifying new instances into one of the classes. Hence the name ‘non-parametric algorithm’.

Below is an image showing what kNN is used for:

To understand which of these ‘k’ values need to be calculated, the shortest distance between the new instance and the data point from the training set is considered. In general, for real-valued data, and data that is similar in type (classifying as sweet or spicy, finding heights, weights), Euclidean distance is used.

But based on the nature of our data, various other methods can be used, such as Hamming distance, Jaccard distance, Cosine distance or Manhattan distance.

How to find which value of ‘k’ is right for my dataset?

The answer to this is, ‘it depends on the dataset’. Usually, the trial and error method is used to see what value of ‘k’ gives the highest accuracy in that specific case. Values ranging from 2 to 15 are experimented with.

We will see the implementation of kNN from scratch. We will be using the iris dataset to implement this algorithm. Download it from here.

It will be downloaded as a zip file which needs to be unzipped and the path of this CSV file has to be supplied to the below code.

Implementation of kNN from scratch using Euclidean Distance

Here’s the implementation:

import csv 
import random 
import operator 
import math 
iris_data_path = "path to iris_data.csv" 
def loadDataset(filename, split, trainingSet=[] , testSet=[]): 
with open(filename, 'r') as csvfile: 
lines = csv.reader(csvfile) 
dataset = list(lines) 
for x in range(len(dataset)-1): 
for y in range(4): 
dataset[x][y] = float(dataset[x][y]) 
if random.random() < split: 
trainingSet.append(dataset[x]) 
else: 
testSet.append(dataset[x]) 
def euclideanDistance(instance1, instance2, length): 
distance = 0 
for x in range(length): 
distance += pow((instance1[x] - instance2[x]), 2) 
return math.sqrt(distance) 
def getNeighbors(trainingSet, testInstance, k): 
distances = [] 
length = len(testInstance)-1 
for x in range(len(trainingSet)): 
dist = euclideanDistance(testInstance, trainingSet[x], length) 
distances.append((trainingSet[x], dist)) 
distances.sort(key=operator.itemgetter(1)) 
neighbors = [] 
for x in range(k): 
neighbors.append(distances[x][0]) 
return neighbors 
def getResponse(neighbors): 
classVotes = {} 
for x in range(len(neighbors)): 
response = neighbors[x][-1] 
if response in classVotes: 
classVotes[response] += 1 
else: 
classVotes[response] = 1 
sortedVotes = sorted(classVotes.items(), key=operator.itemgetter(1), reverse=True) 
return sortedVotes[0][0] 
def getAccuracy(testSet, predictions): 
correct = 0 
for x in range(len(testSet)): 
if testSet[x][-1] == predictions[x]: 
correct += 1 
return (correct/float(len(testSet))) * 100.0 
def main(): 
# prepare data 
trainingSet=[] 
testSet=[] 
split = 0.67 
loadDataset(iris_data_path, split, trainingSet, testSet) 
print('Train set: ' + repr(len(trainingSet))) 
print('Test set: ' + repr(len(testSet))) 
generate predictions predictions=[] 
k = 3 
for x in range(len(testSet)): 
neighbors = getNeighbors(trainingSet, testSet[x], k) 
result = getResponse(neighbors) 
predictions.append(result) 
print('> predicted=' + repr(result) + ', actual=' + repr(testSet[x][-1])) 
accuracy = getAccuracy(testSet, predictions) 
print('Accuracy: ' + repr(accuracy) + '%') 
main()

Advantages of kNN Algorithm

Easy to understand and implement.
No need to build a model or tune parameters or make assumptions.
Versatile algorithm- can be used for classification and regression.

Disadvantages of kNN Algorithm

It becomes slow when the number of independent variables increases.

Conclusion

In today's post, we understood how kNN algorithm works, how the value of 'k' is chosen based on different factors, its implementation from scratch, its advantages and its disadvantages.

11-A Numpy

13-A Hyperparameter Tuning

Your email address will not be published. Required fields are marked *

Comments

Vinu

After reading your article, I was amazed. I know that you explain it very well. And I hope that other readers will also experience how I feel after reading your article. Thanks for sharing.

Johnson M

Good and informative article.

Vinu

I enjoyed reading your articles. This is truly a great read for me. Keep up the good work!

Vinu

Awesome blog. I enjoyed reading this article. This is truly a great read for me. Keep up the good work!

best data science courses in India

Thanks for sharing this article!! Machine learning is a branch of artificial intelligence (AI) and computer science that focus on the uses of data and algorithms. I came to know a lot of information from this article.

View More Comments

Search

Machine Learning Tutorial

By KnowledgeHut .

Machine Learning Tutorial

K-Nearest Neighbors (KNN)

What is kNN?

How to find which value of ‘k’ is right for my dataset?

Implementation of kNN from scratch using Euclidean Distance

Conclusion

Leave a Reply

Comments

Vinu

Johnson M

Vinu

Vinu

best data science courses in India