If you are thinking of a simple, easy-to-implement supervised machine learning algorithm which can be used to solve both classification as well as regression problems, K-Nearest Neighbor (K-NN) is the perfect choice. Learning K-NN is a great way to introduce yourself to machine learning and classification in general. Also, you will find a lot of intense application of K-NN in data mining, pattern recognition, semantic searching, intrusion detection and anomaly detection.
K-Nearest Neighbors is one of the most basic supervised machine learning algorithms, yet very essential. A supervised machine learning algorithm is one of the types of machine learning algorithm which is dependent on labelled input data in order to learn a function which is capable of producing an output whenever a new unlabeled data is given as input.
In real life scenarios, K-NN is widely used as it is non-parametric which means it does not make any underlying assumptions about the distributions of data. With the business world entirely revolving around Data Science, it has become one of the most lucrative fields. Hence, the heavy demand for a Data Science Certification.
Let us look into how different is a parametric machine learning algorithm from a nonparametric machine learning algorithm.
Machine learning, in other words can be called as learning a function (f) which maps input variables (X) to the output variables (Y).
An algorithm learns about the target mapping function from the training data. As we are unaware of the form of the function, we have to evaluate various machine learning algorithms and figure out which algorithms perform better at providing an approximation of the underlying function.
Statistical Methods are classified on the basis of what we know about the population we are studying.
This particular algorithm involves two steps:
Let us consider a line to understand functional form for the mapping function as it is used in linear regression and simplify the learning process.
b0 + b1*x1 + b2*x2 = 0
Where b0, b1 and b2 are the coefficients of the line which control the intercept and slope, and x1 and x2 are two input variables.
All we have to do now is to estimate the coefficients of the line equation to get a predictive model for the problem. Now, the problem is that the actual unknown underlying function may not be a linear function like a line. In that case, the approach will give poor results. Some of the examples of parametric machine learning algorithms are mentioned below:
Nonparametric methods always try to find the best fit training data while constructing the mapping function which also allows it to fit a large number of functional forms. Some of the examples of nonparametric machine learning algorithms are mentioned below:
The best example of nonparametric machine learning algorithms would be k-nearest neighbors algorithm which makes predictions based on the k most similar training patterns for a given set of new data instance. This method simply assumes that the patterns which are close are likely to be of similar type.
|Parametric Machine Learning Algorithms||Nonparametric Machine Learning Algorithms|
There are several learning models namely:
Let us understand what Instance Based Learning is all about.
Under Instance-based Learning we have,
Uses k “closest” points (nearest neighbors) for performing classification. For example: It’s how people judge by observing our peers. We tend to move with people of similar attributes.
|Lazy Learning||Eager Learning|
|Simply stores the training data and waits until it is given a test tuple.||Munges the training data as soon as it receives it.|
|It's slow as it calculates based on the current data set instead of coming up with an algorithm based on historical data.||It's fast as it has pre-calculated algorithm.|
|Localized data so generalization takes time at every iteration.||On the basis of training set ,it constructs a classification model before receiving new data to classify.|
One of the biggest applications of K-Nearest Neighbor search is Recommender Systems. If you have noticed while you are shopping as a user on Amazon and you like a particular item, you are recommended with similar items.
It also recommends similar items bought by other users and other set of items which are often bought together. Basically, the algorithm compares the set of users who like each item and looks for similarity. This not only applies to recommending items or products but also recommending media and even advertisements to display to a user.
K nearest neighbors or K-NN Algorithm is a simple algorithm which uses the entire dataset in its training phase. Whenever a prediction is required for an unseen data instance, it searches through the entire training dataset for k-most similar instances and the data with the most similar instance is finally returned as the prediction.
This algorithm suggests that if you’re similar to your neighbours, then you are one of them. Let us consider a simple example, if apple looks more similar to peach, pear, and cherry (fruits) than monkey, cat or a rat (animals), then most likely apple is a fruit.
Nearest Neighbours algorithm has been in action for the last sixty years. It is mainly used in statistical estimation and pattern recognition, as a non-parametric method, for regression and classification. The main aim of the K-Nearest Neighbor algorithm is to classify a new data point by comparing it to all previously seen data points. The classification of the k most similar previous cases are used for predicting the classification of the current data point. It is a simple algorithm which stores all available cases and classifies new cases based on a similarity measure (e.g., distance functions).
K-NN algorithm can be used for applications which require high accuracy as it makes highly accurate predictions. The quality of predictions is completely dependent on the distance measure. Thus, this algorithm is suitable for applications for which you have sufficient domain knowledge so that it can help you select an appropriate measure.
As we have already seen K-NN algorithm is a type of lazy learning, the computation for the generation is postponed until classification which indeed increases the costs of computation compared to other machine learning algorithms. But still K-NN is considered to be the better choice for applications where accuracy is more important and predictions are not requested frequently.
K-NN can be used for both regression and classification predictive problems. However, in the industry it is mostly used in classification problems.
Generally we mainly look at 3 important aspects in order to evaluate any technique:
Let us consider a few examples to place K-NN in the scale :
If you notice the chart mentioned above, K-NN algorithm exceeds in most of the parameters. It is most commonly used for ease of interpretation and low calculation time.
K-NN algorithm works on the basis of feature similarity. The classification of a given data point is determined by how closely out-of-sample features resemble our training set.
The above figure shows an example of k-NN classification. If you consider the nearest neighbor to the test sample, it is a blue square (Class 1) and k=1. This falls inside the inner circle.
Now, if you consider k=3, then you will see 2 red triangles and only 1 blue square falls under the outer circle. Thus, the test sample is classified as a red triangle now (Class 2).
Similarly, if you consider k=5, it is assigned to the first class (3 squares vs. 2 triangles outside the outer circle).
In regression problems, K-NN is used for prediction based on the mean or the median of the K-most similar instances.
K-nearest-neighbor classification was actually developed from the need to perform discriminant analysis when reliable parametric estimates of probability densities are unknown or are difficult to determine. When K-NN is used for classification, the output is easily calculated by the class having the highest frequency from the K-most similar instances. The class with maximum vote is taken into consideration for prediction.
The probabilities of Classes can be calculated as the normalized frequency of samples that belong to each class in the set of K most similar instances for a new data instance.
For example, in a binary classification problem (class is 0 or 1):
p(class=0) = count(class=0) / (count(class=0)+count(class=1))
If you are using K and you have an even number of classes (e.g. 2) it is a good idea to choose a K value with an odd number to avoid a tie. And the inverse, use an even number for K when you have an odd number of classes.
Ties can be broken consistently by expanding K by 1 and looking at the class of the next most similar instance in the training dataset.
A case can be classified by a majority vote of its neighbors. The case is then assigned to the most common class amongst its K nearest neighbors measured by a distance function. Suppose the value of K is 1, then the case is simply assigned to the class of its nearest neighbor.
The three distance measures mentioned above are valid only for continuous variables. For categorical variables, the Hamming distance is used. It also brings up the issue of standardization of the numerical variables between 0 and 1 when there is a mixture of numerical and categorical variables in the dataset.
By inspecting the data, you can choose the best optimal value for K. Generally, a large value of K is more accurate as it tends to reduce the overall noise but is not always true. Another way to retrospectively determine a good K value by using an independent dataset to validate the K value is Cross-validation. According to observation, the optimal K for most datasets has been between 3-10 which provides better results than 1NN.
For example, let us consider an example where the data mentioned below us concerned with credit default. Age and Loan are two numerical variables (predictors) and Default is the target.
By observing the data mentioned above, we can use the training set in order to classify an unknown case (Age=48 and Loan=$142,000) using Euclidean distance. If K=1 then the nearest neighbor is the last case in the training set with Default=Y.
With K=3, there are two Default=Y and one Default=N out of three closest neighbors. The prediction for the unknown case is again Default=Y.
One major drawback in calculating distance measures directly from the training set is in the case where variables have different measurement scales or there is a mixture of numerical and categorical variables. For example, if one variable is based on annual income in dollars, and the other is based on age in years then income will have a much higher influence on the distance calculated. One solution is to standardize the training set as shown below.
Using the standardized distance on the same training set, the unknown case returned a different neighbor which is not a good sign of robustness.
Between-sample geometric distance
The k-nearest-neighbor classifier is commonly based on the Euclidean distance between a test sample and the specified training samples. Let xi be an input sample with p features, (xi1, xi2, …, xip), n be the total number of input samples (i=1,2,…,n) and p the total number of features (j=1,2,…,p) . The Euclidean distance between sample xi and xl (l=1,2,…,n) is defined as:
A graphical representation of the nearest neighbor concept is illustrated in the Voronoi tessellation. The tessellation shows 19 samples marked with a "+", and the Voronoi cell, R, surrounding each sample. A Voronoi cell encapsulates all neighboring points that are nearest to each sample and is defined as:
Where Ri is the Voronoi cell for sample xi, and x represents all possible points within Voronoi cell Ri.
The Voronoi tessellation reflects two characteristics of the example 2-dimensional coordinate system: i) all possible points within a sample's Voronoi cell are the nearest neighboring points for that sample, and ii) for any sample, the nearest sample is determined by the closest Voronoi cell edge.
According to the latter characteristic, the k-nearest-neighbor classification rule is to assign to a test sample the majority category label of its k nearest training samples. In practice, k is usually chosen to be odd, so as to avoid ties. The k = 1 rule is generally called the nearest-neighbor classification rule.
The curse of dimensionality refers to various phenomena that are witnessed while analyzing and organizing data in high-dimensional spaces (often with hundreds or thousands of dimensions). Such phenomenon do not occur in low-dimensional settings such as the three-dimensional physical space of everyday experience.
K-NN algorithm will work absolutely fine when you are dealing with a small number of input variables (p) but will struggle when there are a large number of inputs.
K-NN works well with a small number of input variables (p), but struggles when the number of inputs is very large. Each input variable can be considered a dimension of a p-dimensional input space. For example, suppose you have two input variables x1 and x2, the input space would be 2-dimensional. With the increase in the number of dimensions, the volume of the input space increases at an exponential rate.
In case of higher dimensions, the points which are similar may have large distances. All these points will be then away from each other and our intuition about 2 to 3 dimensional spaces will not be applicable. This kind of problem is called the “Curse of Dimensionality“.
K-Means Clustering and k-Nearest Neighbors algorithm, both are commonly used algorithms in Machine Learning. They are often confused with each other, especially when we are talking about the k-factor. The ‘K’ in K-Means Clustering has nothing to do with the ‘K’ in K-NN algorithm. k-Means Clustering is an unsupervised learning algorithm that is used for clustering whereas K-NN is a supervised learning algorithm used for classification.
The k-means algorithm is an unsupervised clustering algorithm which takes a couple of unlabeled points and then groups them into “k” number of clusters.
The “k” in k-means denotes the number of clusters you would like to have in the end. Suppose the value of k is 5, it means you will have 5 clusters on the data set.
Let us see how it works.
Step 1: First you determine the value of K by Elbow method and then specify the number of clusters K
Step 2: Next you have to randomly assign each data point to a cluster
Step 3: Determine the cluster centroid coordinates
Step 4: Determine the distances of each data point to the centroids and re-assign each point to the closest cluster centroid based upon minimum distance
Step 5: Calculate cluster centroids again
Step 6: Repeat steps 4 and 5 until we reach global optima where further improvements are not possible and there is no provision to switch data points from one cluster to another.
#Finding the optimum number of clusters for k-means clustering Nc = range(1, 10) kmeans = [KMeans(n_clusters=i) for i in Nc] kmeans score = [kmeans[i].fit(x).score(x) for i in range(len(kmeans))] score pl.plot(Nc,score) pl.xlabel('Number of Clusters') pl.ylabel('Score') pl.title('Elbow Curve') pl.show()
You can clearly see why it is called 'The elbow method' from the above graph, the optimum clusters is where the elbow occurs.
Now that we have the optimum amount of clusters (k=3), we can move on to applying K-means clustering to the Iris dataset.
#Implementation of K-Means Clustering model = KMeans(n_clusters = 3) model.fit(x) model.labels_ colormap = np.array(['Red', 'Blue', 'Green']) z = plt.scatter(x.sepal_length, x.sepal_width, x.petal_length, c = colormap[model.labels_])
#Accuracy of K-Means Clustering accuracy_score(iris.target,model.labels_) 0.8933333333333333
By now, we already know that K-NN algorithm is a supervised classification algorithm. It takes into consideration a couple of labelled points and then uses those points to learn how to label other points. To be able to assign label to other points, K-NN algorithm looks for the closest neighbor of the new point and checks for voting. The most number of neighbors around the new point decide the label of the new point.
The “k” in K-Nearest Neighbors is the number of neighbors it checks. It is supervised because it is trying to classify a point on the basis of the known classification of other points.
Let us see how it works.
Step 1: Firstly, you determine the value for K.
Step 2: Then you calculate the distances between the new input (test data) and all the training data. The most commonly used metrics for calculating distance are Euclidean, Manhattan and Minkowski
Step 3: Sort the distance and determine k nearest neighbors based on minimum distance values
Step 4: Analyze the category of those neighbors and assign the category for the test data based on majority vote
Step 5: Return the predicted class
error =  # Calculating error for K values between 1 and 40 for i in range(1, 40): K-NN = KNeighborsClassifier(n_neighbors=i) K-NN.fit(X_train, y_train) pred_i = K-NN.predict(X_test) error.append(np.mean(pred_i != y_test)) plt.figure(figsize=(12, 6)) plt.plot(range(1, 40), error, color='black', linestyle='dashed', marker='o', markerfacecolor='grey', markersize=10) plt.title('Error Rate K Value') plt.xlabel('K Value') plt.ylabel('Mean Error') Text(0, 0.5, 'Mean Error')
Now we know for what values of ‘K’, the error rate will be less. Let’s fix k=5 and implement K-NN algorithm.
#Creating training and test splits from sklearn.model_selection import train_test_split X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.20) #Performing Feature Scaling from sklearn.preprocessing import StandardScaler scaler = StandardScaler() scaler.fit(X_train) X_train = scaler.transform(X_train) X_test = scaler.transform(X_test) #Training K-NN with k=5 from sklearn.neighbors import KNeighborsClassifier classifier = KNeighborsClassifier(n_neighbors=5) classifier.fit(X_train, y_train) KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski', metric_params=None, n_jobs=None, n_neighbors=5, p=2, weights='uniform') y_pred = classifier.predict(X_test) from sklearn.metrics import classification_report, confusion_matrix print(confusion_matrix(y_test, y_pred)) print(classification_report(y_test, y_pred)) [[10 0 0] [ 0 9 2] [ 0 1 8]] precision recall f1-score support Iris-setosa 1.00 1.00 1.00 10 Iris-versicolor 0.90 0.82 0.86 11 Iris-virginica 0.80 0.89 0.84 9 accuracy 0.90 30 macro avg 0.90 0.90 0.90 30 weighted avg 0.90 0.90 0.90 30
Now that we have we have seen how K-NN works, let us look into some of the practical applications of K-NN.
In this article we have learned about the K-Nearest Neighbor algorithm, where we should use it, how it works and so on. Also, we have discussed about parametric and nonparametric machine learning algorithms, instance based learning, eager and lazy learning, advantages and disadvantages of using K-NN, performance improvement suggestions and have implemented K-NN in Python. To learn more about other machine learning algorithms, join our Data Science Certification course and expand your learning skill set and career opportunities.
Your email address will not be published. Required fields are marked *