If you are thinking of a simple, easytoimplement supervised machine learning algorithm which can be used to solve both classification as well as regression problems, KNearest Neighbor (KNN) is the perfect choice. Learning KNN is a great way to introduce yourself to machine learning and classification in general. Also, you will find a lot of intense application of KNN in data mining, pattern recognition, semantic searching, intrusion detection and anomaly detection.
KNearest Neighbors is one of the most basic supervised machine learning algorithms, yet very essential. A supervised machine learning algorithm is one of the types of machine learning algorithm which is dependent on labelled input data in order to learn a function which is capable of producing an output whenever a new unlabeled data is given as input.
In real life scenarios, KNN is widely used as it is nonparametric which means it does not make any underlying assumptions about the distributions of data. With the business world entirely revolving around Data Science, it has become one of the most lucrative fields. Hence, the heavy demand for a Data Science Certification.
Let us look into how different is a parametric machine learning algorithm from a nonparametric machine learning algorithm.
Machine learning, in other words can be called as learning a function (f) which maps input variables (X) to the output variables (Y).
Y=f(X)
An algorithm learns about the target mapping function from the training data. As we are unaware of the form of the function, we have to evaluate various machine learning algorithms and figure out which algorithms perform better at providing an approximation of the underlying function.
Statistical Methods are classified on the basis of what we know about the population we are studying.
This particular algorithm involves two steps:
Let us consider a line to understand functional form for the mapping function as it is used in linear regression and simplify the learning process.
b_{0} + b_{1}*x_{1} + b_{2}*x_{2} = 0
Where b_{0}, b_{1} and b_{2} are the coefficients of the line which control the intercept and slope, and x_{1} and x_{2} are two input variables.
All we have to do now is to estimate the coefficients of the line equation to get a predictive model for the problem. Now, the problem is that the actual unknown underlying function may not be a linear function like a line. In that case, the approach will give poor results. Some of the examples of parametric machine learning algorithms are mentioned below:
Nonparametric methods always try to find the best fit training data while constructing the mapping function which also allows it to fit a large number of functional forms. Some of the examples of nonparametric machine learning algorithms are mentioned below:
The best example of nonparametric machine learning algorithms would be knearest neighbors algorithm which makes predictions based on the k most similar training patterns for a given set of new data instance. This method simply assumes that the patterns which are close are likely to be of similar type.
Parametric Machine Learning Algorithms  Nonparametric Machine Learning Algorithms  

Benefits 


Limitations 


There are several learning models namely:
Let us understand what Instance Based Learning is all about.
Under Instancebased Learning we have,
Nearestneighbor classifier
Uses k “closest” points (nearest neighbors) for performing classification. For example: It’s how people judge by observing our peers. We tend to move with people of similar attributes.
Lazy Learning  Eager Learning 

Simply stores the training data and waits until it is given a test tuple.  Munges the training data as soon as it receives it. 
It's slow as it calculates based on the current data set instead of coming up with an algorithm based on historical data.  It's fast as it has precalculated algorithm. 
Localized data so generalization takes time at every iteration.  On the basis of training set ,it constructs a classification model before receiving new data to classify. 
One of the biggest applications of KNearest Neighbor search is Recommender Systems. If you have noticed while you are shopping as a user on Amazon and you like a particular item, you are recommended with similar items.
It also recommends similar items bought by other users and other set of items which are often bought together. Basically, the algorithm compares the set of users who like each item and looks for similarity. This not only applies to recommending items or products but also recommending media and even advertisements to display to a user.
K nearest neighbors or KNN Algorithm is a simple algorithm which uses the entire dataset in its training phase. Whenever a prediction is required for an unseen data instance, it searches through the entire training dataset for kmost similar instances and the data with the most similar instance is finally returned as the prediction.
This algorithm suggests that if you’re similar to your neighbours, then you are one of them. Let us consider a simple example, if apple looks more similar to peach, pear, and cherry (fruits) than monkey, cat or a rat (animals), then most likely apple is a fruit.
Nearest Neighbours algorithm has been in action for the last sixty years. It is mainly used in statistical estimation and pattern recognition, as a nonparametric method, for regression and classification. The main aim of the KNearest Neighbor algorithm is to classify a new data point by comparing it to all previously seen data points. The classification of the k most similar previous cases are used for predicting the classification of the current data point. It is a simple algorithm which stores all available cases and classifies new cases based on a similarity measure (e.g., distance functions).
KNN algorithm can be used for applications which require high accuracy as it makes highly accurate predictions. The quality of predictions is completely dependent on the distance measure. Thus, this algorithm is suitable for applications for which you have sufficient domain knowledge so that it can help you select an appropriate measure.
As we have already seen KNN algorithm is a type of lazy learning, the computation for the generation is postponed until classification which indeed increases the costs of computation compared to other machine learning algorithms. But still KNN is considered to be the better choice for applications where accuracy is more important and predictions are not requested frequently.
KNN can be used for both regression and classification predictive problems. However, in the industry it is mostly used in classification problems.
Generally we mainly look at 3 important aspects in order to evaluate any technique:
Let us consider a few examples to place KNN in the scale :
If you notice the chart mentioned above, KNN algorithm exceeds in most of the parameters. It is most commonly used for ease of interpretation and low calculation time.
KNN algorithm works on the basis of feature similarity. The classification of a given data point is determined by how closely outofsample features resemble our training set.
The above figure shows an example of kNN classification. If you consider the nearest neighbor to the test sample, it is a blue square (Class 1) and k=1. This falls inside the inner circle.
Now, if you consider k=3, then you will see 2 red triangles and only 1 blue square falls under the outer circle. Thus, the test sample is classified as a red triangle now (Class 2).
Similarly, if you consider k=5, it is assigned to the first class (3 squares vs. 2 triangles outside the outer circle).
In regression problems, KNN is used for prediction based on the mean or the median of the Kmost similar instances.
Knearestneighbor classification was actually developed from the need to perform discriminant analysis when reliable parametric estimates of probability densities are unknown or are difficult to determine. When KNN is used for classification, the output is easily calculated by the class having the highest frequency from the Kmost similar instances. The class with maximum vote is taken into consideration for prediction.
The probabilities of Classes can be calculated as the normalized frequency of samples that belong to each class in the set of K most similar instances for a new data instance.
For example, in a binary classification problem (class is 0 or 1):
p(class=0) = count(class=0) / (count(class=0)+count(class=1))
If you are using K and you have an even number of classes (e.g. 2) it is a good idea to choose a K value with an odd number to avoid a tie. And the inverse, use an even number for K when you have an odd number of classes.
Ties can be broken consistently by expanding K by 1 and looking at the class of the next most similar instance in the training dataset.
A case can be classified by a majority vote of its neighbors. The case is then assigned to the most common class amongst its K nearest neighbors measured by a distance function. Suppose the value of K is 1, then the case is simply assigned to the class of its nearest neighbor.
The three distance measures mentioned above are valid only for continuous variables. For categorical variables, the Hamming distance is used. It also brings up the issue of standardization of the numerical variables between 0 and 1 when there is a mixture of numerical and categorical variables in the dataset.
By inspecting the data, you can choose the best optimal value for K. Generally, a large value of K is more accurate as it tends to reduce the overall noise but is not always true. Another way to retrospectively determine a good K value by using an independent dataset to validate the K value is Crossvalidation. According to observation, the optimal K for most datasets has been between 310 which provides better results than 1NN.
For example, let us consider an example where the data mentioned below us concerned with credit default. Age and Loan are two numerical variables (predictors) and Default is the target.
By observing the data mentioned above, we can use the training set in order to classify an unknown case (Age=48 and Loan=$142,000) using Euclidean distance. If K=1 then the nearest neighbor is the last case in the training set with Default=Y.
Age  Loan  Default  Distance  

25  $40,000  N  102000  
35  $60,000  N  82000  
45  $80,000  N  62000  
20  $20,000  N  122000  
35  $120,000  N  22000  2 
52  $18,000  N  124000  
23  $95,000  Y  47000  
40  $62,000  Y  80000  
60  $100,000  Y  42000  3 
48  $220,000  Y  78000  
33  $150,000  Y  8000  1 
48  $142,000  ? 
With K=3, there are two Default=Y and one Default=N out of three closest neighbors. The prediction for the unknown case is again Default=Y.
Standardized Distance
One major drawback in calculating distance measures directly from the training set is in the case where variables have different measurement scales or there is a mixture of numerical and categorical variables. For example, if one variable is based on annual income in dollars, and the other is based on age in years then income will have a much higher influence on the distance calculated. One solution is to standardize the training set as shown below.
Age  Loan  Default  Distance 

0.125  0.11  N  0.7652 
0.375  0.21  N  0.5200 
0.625  0.31  N  0.3160 
0  0.01  N  0.9245 
0.375  0.50  N  0.3428 
0.8  0.00  N  0.6220 
0.075  0.38  Y  0.6669 
0.5  0.22  Y  0.4437 
1  0.41  Y  0.3650 
0.7  1.00  Y  0.3861 
0.325  0.65  Y  0.3771 
0.7  0.61  ? 
Using the standardized distance on the same training set, the unknown case returned a different neighbor which is not a good sign of robustness.
Betweensample geometric distance
The knearestneighbor classifier is commonly based on the Euclidean distance between a test sample and the specified training samples. Let xi be an input sample with p features, (xi1, xi2, …, xip), n be the total number of input samples (i=1,2,…,n) and p the total number of features (j=1,2,…,p) . The Euclidean distance between sample xi and xl (l=1,2,…,n) is defined as:
A graphical representation of the nearest neighbor concept is illustrated in the Voronoi tessellation. The tessellation shows 19 samples marked with a "+", and the Voronoi cell, R, surrounding each sample. A Voronoi cell encapsulates all neighboring points that are nearest to each sample and is defined as:
Where Ri is the Voronoi cell for sample xi, and x represents all possible points within Voronoi cell Ri.
The Voronoi tessellation reflects two characteristics of the example 2dimensional coordinate system: i) all possible points within a sample's Voronoi cell are the nearest neighboring points for that sample, and ii) for any sample, the nearest sample is determined by the closest Voronoi cell edge.
According to the latter characteristic, the knearestneighbor classification rule is to assign to a test sample the majority category label of its k nearest training samples. In practice, k is usually chosen to be odd, so as to avoid ties. The k = 1 rule is generally called the nearestneighbor classification rule.
The curse of dimensionality refers to various phenomena that are witnessed while analyzing and organizing data in highdimensional spaces (often with hundreds or thousands of dimensions). Such phenomenon do not occur in lowdimensional settings such as the threedimensional physical space of everyday experience.
KNN algorithm will work absolutely fine when you are dealing with a small number of input variables (p) but will struggle when there are a large number of inputs.
KNN works well with a small number of input variables (p), but struggles when the number of inputs is very large. Each input variable can be considered a dimension of a pdimensional input space. For example, suppose you have two input variables x1 and x2, the input space would be 2dimensional. With the increase in the number of dimensions, the volume of the input space increases at an exponential rate.
In case of higher dimensions, the points which are similar may have large distances. All these points will be then away from each other and our intuition about 2 to 3 dimensional spaces will not be applicable. This kind of problem is called the “Curse of Dimensionality“.
KMeans Clustering and kNearest Neighbors algorithm, both are commonly used algorithms in Machine Learning. They are often confused with each other, especially when we are talking about the kfactor. The ‘K’ in KMeans Clustering has nothing to do with the ‘K’ in KNN algorithm. kMeans Clustering is an unsupervised learning algorithm that is used for clustering whereas KNN is a supervised learning algorithm used for classification.
The kmeans algorithm is an unsupervised clustering algorithm which takes a couple of unlabeled points and then groups them into “k” number of clusters.
The “k” in kmeans denotes the number of clusters you would like to have in the end. Suppose the value of k is 5, it means you will have 5 clusters on the data set.
Let us see how it works.
Step 1: First you determine the value of K by Elbow method and then specify the number of clusters K
Step 2: Next you have to randomly assign each data point to a cluster
Step 3: Determine the cluster centroid coordinates
Step 4: Determine the distances of each data point to the centroids and reassign each point to the closest cluster centroid based upon minimum distance
Step 5: Calculate cluster centroids again
Step 6: Repeat steps 4 and 5 until we reach global optima where further improvements are not possible and there is no provision to switch data points from one cluster to another.
#Finding the optimum number of clusters for kmeans clustering Nc = range(1, 10) kmeans = [KMeans(n_clusters=i) for i in Nc] kmeans score = [kmeans[i].fit(x).score(x) for i in range(len(kmeans))] score pl.plot(Nc,score) pl.xlabel('Number of Clusters') pl.ylabel('Score') pl.title('Elbow Curve') pl.show()
You can clearly see why it is called 'The elbow method' from the above graph, the optimum clusters is where the elbow occurs.
Now that we have the optimum amount of clusters (k=3), we can move on to applying Kmeans clustering to the Iris dataset.
#Implementation of KMeans Clustering model = KMeans(n_clusters = 3) model.fit(x) model.labels_ colormap = np.array(['Red', 'Blue', 'Green']) z = plt.scatter(x.sepal_length, x.sepal_width, x.petal_length, c = colormap[model.labels_])
#Accuracy of KMeans Clustering
accuracy_score(iris.target,model.labels_)
0.8933333333333333
By now, we already know that KNN algorithm is a supervised classification algorithm. It takes into consideration a couple of labelled points and then uses those points to learn how to label other points. To be able to assign label to other points, KNN algorithm looks for the closest neighbor of the new point and checks for voting. The most number of neighbors around the new point decide the label of the new point.
The “k” in KNearest Neighbors is the number of neighbors it checks. It is supervised because it is trying to classify a point on the basis of the known classification of other points.
Let us see how it works.
Step 1: Firstly, you determine the value for K.
Step 2: Then you calculate the distances between the new input (test data) and all the training data. The most commonly used metrics for calculating distance are Euclidean, Manhattan and Minkowski
Step 3: Sort the distance and determine k nearest neighbors based on minimum distance values
Step 4: Analyze the category of those neighbors and assign the category for the test data based on majority vote
Step 5: Return the predicted class
error = [] # Calculating error for K values between 1 and 40 for i in range(1, 40): KNN = KNeighborsClassifier(n_neighbors=i) KNN.fit(X_train, y_train) pred_i = KNN.predict(X_test) error.append(np.mean(pred_i != y_test)) plt.figure(figsize=(12, 6)) plt.plot(range(1, 40), error, color='black', linestyle='dashed', marker='o', markerfacecolor='grey', markersize=10) plt.title('Error Rate K Value') plt.xlabel('K Value') plt.ylabel('Mean Error') Text(0, 0.5, 'Mean Error')
Now we know for what values of ‘K’, the error rate will be less. Let’s fix k=5 and implement KNN algorithm.
#Creating training and test splits from sklearn.model_selection import train_test_split X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.20) #Performing Feature Scaling from sklearn.preprocessing import StandardScaler scaler = StandardScaler() scaler.fit(X_train) X_train = scaler.transform(X_train) X_test = scaler.transform(X_test) #Training KNN with k=5 from sklearn.neighbors import KNeighborsClassifier classifier = KNeighborsClassifier(n_neighbors=5) classifier.fit(X_train, y_train) KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski', metric_params=None, n_jobs=None, n_neighbors=5, p=2, weights='uniform') y_pred = classifier.predict(X_test) from sklearn.metrics import classification_report, confusion_matrix print(confusion_matrix(y_test, y_pred)) print(classification_report(y_test, y_pred)) [[10 0 0] [ 0 9 2] [ 0 1 8]] precision recall f1score support Irissetosa 1.00 1.00 1.00 10 Irisversicolor 0.90 0.82 0.86 11 Irisvirginica 0.80 0.89 0.84 9 accuracy 0.90 30 macro avg 0.90 0.90 0.90 30 weighted avg 0.90 0.90 0.90 30
Now that we have we have seen how KNN works, let us look into some of the practical applications of KNN.
In this article we have learned about the KNearest Neighbor algorithm, where we should use it, how it works and so on. Also, we have discussed about parametric and nonparametric machine learning algorithms, instance based learning, eager and lazy learning, advantages and disadvantages of using KNN, performance improvement suggestions and have implemented KNN in Python. To learn more about other machine learning algorithms, join our Data Science Certification course and expand your learning skill set and career opportunities.
Data has become the new game changer for busines... Read More
Statistics is a science concerned with collection,... Read More
What is Dispersion in StatisticsDispersion in stat... Read More
Join the Discussion
Your email address will not be published. Required fields are marked *