Search

Machine learning Filter

What is K-Nearest Neighbor in Machine Learning: K-NN Algorithm

If you are thinking of a simple, easy-to-implement supervised machine learning algorithm which can be used to solve both classification as well as regression problems, K-Nearest Neighbor (K-NN) is the perfect choice. Learning K-NN is a great way to introduce yourself to machine learning and classification in general. Also, you will find a lot of intense application of K-NN in data mining, pattern recognition, semantic searching, intrusion detection and anomaly detection.K-Nearest Neighbors is one of the most basic supervised machine learning algorithms, yet very essential. A supervised machine learning algorithm is one of the types of machine learning algorithm which is dependent on labelled input data in order to learn a function which is capable of producing an output whenever a new unlabeled data is given as input.In real life scenarios, K-NN is widely used as it is non-parametric which means it does not make any underlying assumptions about the distributions of data. With the business world entirely revolving around Data Science, it has become one of the most lucrative fields. Hence, the heavy demand for a Data Science Certification.Parametric vs Non-parametric MethodsLet us look into how different is a parametric machine learning algorithm from a nonparametric machine learning algorithm.Machine learning, in other words can be called as learning a function (f) which maps input variables (X) to the output variables (Y).Y=f(X)An algorithm learns about the target mapping function from the training data. As we are unaware of the form of the function, we have to evaluate various machine learning algorithms and figure out which algorithms perform better at providing an approximation of the underlying function.Statistical Methods are classified on the basis of what we know about the population we are studying.Parametric statistics is a branch of statistics which assumes that sample data comes from a population that follows a probability distribution based on a fixed set of parameters.Nonparametric statistics is the branch of statistics that is not based solely on population parameters.Parametric Machine Learning AlgorithmsThis particular algorithm involves two steps:Selecting a form for the functionLearning the coefficients for the function from the training dataLet us consider a line to understand functional form for the mapping function as it is used in linear regression and simplify the learning process.b0 + b1*x1 + b2*x2 = 0Where b0, b1 and b2 are the coefficients of the line which control the intercept and slope, and x1 and x2 are two input variables.All we have to do now is to estimate the coefficients of the line equation to get a predictive model for the problem. Now, the problem is that the actual unknown underlying function may not be a linear function like a line. In that case, the approach will give poor results. Some of the examples of parametric machine learning algorithms are mentioned below:Logistic RegressionLinear Discriminant AnalysisPerceptronNaive BayesSimple Neural NetworksNonparametric Machine Learning AlgorithmsNonparametric methods always try to find the best fit training data while constructing the mapping function which also allows it to fit a large number of functional forms. Some of the examples of nonparametric machine learning algorithms are mentioned below:k-Nearest NeighborsDecision Trees like CART and C4.5Support Vector MachinesThe best example of nonparametric machine learning algorithms would be k-nearest neighbors algorithm which makes predictions based on the k most similar training patterns for a given set of new data instance. This method simply assumes that the patterns which are close are likely to be of similar type.Parametric Machine Learning AlgorithmsNonparametric Machine Learning AlgorithmsBenefitsSimple to understand and interpret resultsSpeed of learning from data in fastLess training data is requiredFlexible enough to fit a large number of functional formsNo assumptions about the underlying functionsProvides high performance for predictionLimitationsChoosing a functional form constrains the method to the specified formIt has limited complexity and more suited to simpler problemsIt is unlikely to match the underlying mapping function and results in poor fitRequires more training data in order to estimate the mapping functionDue to more parameters to train, it is slower comparativelyThere is a risk to overfit the training dataMethod Based LearningThere are several learning models namely:Association rules basedEnsemble method basedDeep Learning basedClustering method basedRegression Analysis basedBayesian method basedDimensionality reduction basedKernel method basedInstance basedLet us understand what Instance Based Learning is all about.Instance Based Learning (IBL)Instance-Based methods are the simplest form of learningInstance -Based learning is lazy learningK-NN model works on identified instanceInstances are retrieved from memory and then this data is used to classify the new query instanceInstance based learning is also called memory-based or case-basedUnder Instance-based Learning we have,Nearest-neighbor classifierUses k “closest” points (nearest neighbors) for performing classification. For example: It’s how people judge by observing our peers. We tend to move with people of similar attributes.Lazy Learning vs Eager LearningLazy LearningEager LearningSimply stores the training data and waits until it is given a test tuple.Munges the training data as soon as it receives it.It's slow as it calculates based on the current data set instead of coming up with an algorithm based on historical data.It's fast as it has pre-calculated algorithm.Localized data so generalization takes time at every iteration.On the basis of training set ,it constructs a classification model before receiving new data to classify.What is K-NN?One of the biggest applications of K-Nearest Neighbor search is Recommender Systems. If you have noticed while you are shopping as a user on Amazon and you like a particular item, you are recommended with similar items.It also recommends similar items bought by other users and other set of items which are often bought together. Basically, the algorithm compares the set of users who like each item and looks for similarity. This not only applies to recommending items or products but also recommending media and even advertisements to display to a user.K nearest neighbors or K-NN Algorithm is a simple algorithm which uses the entire dataset in its training phase. Whenever a prediction is required for an unseen data instance, it searches through the entire training dataset for k-most similar instances and the data with the most similar instance is finally returned as the prediction.This algorithm suggests that if you’re similar to your neighbours, then you are one of them. Let us consider a simple example, if apple looks more similar to peach, pear, and cherry (fruits) than monkey, cat or a rat (animals), then most likely apple is a fruit.Nearest Neighbours algorithm has been in action for the last sixty years. It is mainly used in statistical estimation and pattern recognition, as a non-parametric method, for regression and classification. The main aim of the K-Nearest Neighbor algorithm is to classify a new data point by comparing it to all previously seen data points. The classification of the k most similar previous cases are used for predicting the classification of the current data point. It is a simple algorithm which stores all available cases and classifies new cases based on a similarity measure (e.g., distance functions).When do we use K-NN algorithm?K-NN algorithm can be used for applications which require high accuracy as it makes highly accurate predictions. The quality of predictions is completely dependent on the distance measure. Thus, this algorithm is suitable for applications for which you have sufficient domain knowledge so that it can help you select an appropriate measure.As we have already seen K-NN algorithm is a type of lazy learning, the computation for the generation is postponed until classification which indeed increases the costs of computation compared to other machine learning algorithms. But still K-NN is considered to be the better choice for applications where accuracy is more important and predictions are not requested frequently.K-NN can be used for both regression and classification predictive problems. However, in the industry it is mostly used in classification problems.Generally we mainly look at 3 important aspects in order to evaluate any technique:Ease to interpret outputCalculation timePredictive PowerLet us consider a few examples to place K-NN in the scale :If you notice the chart mentioned above, K-NN algorithm exceeds in most of the parameters. It is most commonly used for ease of interpretation and low calculation time.How does the K-NN algorithm work?K-NN algorithm works on the basis of feature similarity. The classification of a given data point is determined by how closely out-of-sample features resemble our training set.The above figure shows an example of k-NN classification. If you consider the nearest neighbor to the test sample, it is a blue square (Class 1) and k=1. This falls inside the inner circle.Now, if you consider k=3, then you will see 2 red triangles and only 1 blue square falls under the outer circle. Thus, the test sample is classified as a red triangle now (Class 2).Similarly, if you consider k=5, it is assigned to the first class (3 squares vs. 2 triangles outside the outer circle).K-NN in RegressionIn regression problems, K-NN is used for prediction based on the mean or the median of the K-most similar instances.K-NN in ClassificationK-nearest-neighbor classification was actually developed from the need to perform discriminant analysis when reliable parametric estimates of probability densities are unknown or are difficult to determine. When K-NN is used for classification, the output is easily calculated by the class having the highest frequency from the K-most similar instances. The class with maximum vote is taken into consideration for prediction.The probabilities of Classes can be calculated as the normalized frequency of samples that belong to each class in the set of K most similar instances for a new data instance.For example, in a binary classification problem (class is 0 or 1):p(class=0) = count(class=0) / (count(class=0)+count(class=1))If you are using K and you have an even number of classes (e.g. 2) it is a good idea to choose a K value with an odd number to avoid a tie. And the inverse, use an even number for K when you have an odd number of classes.Ties can be broken consistently by expanding K by 1 and looking at the class of the next most similar instance in the training dataset.Making Predictions with K-NNA case can be classified by a majority vote of its neighbors. The case is then assigned to the most common class amongst its K nearest neighbors measured by a distance function. Suppose the value of K is 1, then the case is simply assigned to the class of its nearest neighbor.The three distance measures mentioned above are valid only for continuous variables. For categorical variables, the Hamming distance is used. It also brings up the issue of standardization of the numerical variables between 0 and 1 when there is a mixture of numerical and categorical variables in the dataset.By inspecting the data, you can choose the best optimal value for K. Generally, a large value of K is more accurate as it tends to reduce the overall noise but is not always true. Another way to retrospectively determine a good K value by using an independent dataset to validate the K value is Cross-validation. According to observation, the optimal K for most datasets has been between 3-10 which provides better results than 1NN.For example, let us consider an example where the data mentioned below us concerned with credit default. Age and Loan are two numerical variables (predictors) and Default is the target.By observing the data mentioned above, we can use the training set in order to classify an unknown case (Age=48 and Loan=$142,000) using Euclidean distance. If K=1 then the nearest neighbor is the last case in the training set with Default=Y.AgeLoanDefaultDistance25$40,000N10200035$60,000N8200045$80,000N6200020$20,000N12200035$120,000N22000252$18,000N12400023$95,000Y4700040$62,000Y8000060$100,000Y42000348$220,000Y7800033$150,000Y8000148$142,000?Euclidean DistanceWith K=3, there are two Default=Y and one Default=N out of three closest neighbors. The prediction for the unknown case is again Default=Y.Standardized DistanceOne major drawback in calculating distance measures directly from the training set is in the case where variables have different measurement scales or there is a mixture of numerical and categorical variables. For example, if one variable is based on annual income in dollars, and the other is based on age in years then income will have a much higher influence on the distance calculated. One solution is to standardize the training set as shown below.AgeLoanDefaultDistance0.1250.11N0.76520.3750.21N0.52000.6250.31N0.316000.01N0.92450.3750.50N0.34280.80.00N0.62200.0750.38Y0.66690.50.22Y0.443710.41Y0.36500.71.00Y0.38610.3250.65Y0.37710.70.61?Standardized VariableUsing the standardized distance on the same training set, the unknown case returned a different neighbor which is not a good sign of robustness.Between-sample geometric distanceThe k-nearest-neighbor classifier is commonly based on the Euclidean distance between a test sample and the specified training samples. Let  xi be an input sample with p features, (xi1, xi2, …, xip), n be the total number of input samples (i=1,2,…,n) and p the total number of features (j=1,2,…,p) . The Euclidean distance between sample xi and xl (l=1,2,…,n) is defined as:A graphical representation of the nearest neighbor concept is illustrated in the Voronoi tessellation. The tessellation shows 19 samples marked with a "+", and the Voronoi cell, R, surrounding each sample. A Voronoi cell encapsulates all neighboring points that are nearest to each sample and is defined as:Where Ri is the Voronoi cell for sample xi, and x represents all possible points within Voronoi cell Ri.Voronoi tessellation showing Voronoi cells of 19 samples marked with a "+"The Voronoi tessellation reflects two characteristics of the example 2-dimensional coordinate system: i) all possible points within a sample's Voronoi cell are the nearest neighboring points for that sample, and ii) for any sample, the nearest sample is determined by the closest Voronoi cell edge.According to the latter characteristic, the k-nearest-neighbor classification rule is to assign to a test sample the majority category label of its k nearest training samples. In practice, k is usually chosen to be odd, so as to avoid ties. The k = 1 rule is generally called the nearest-neighbor classification rule.Curse of DimensionalityThe curse of dimensionality refers to various phenomena that are witnessed while analyzing and organizing data in high-dimensional spaces (often with hundreds or thousands of dimensions). Such phenomenon do not occur in low-dimensional settings such as the three-dimensional physical space of everyday experience.K-NN algorithm will work absolutely fine when you are dealing with a small number of input variables (p)  but will struggle when there are a large number of inputs.K-NN works well with a small number of input variables (p), but struggles when the number of inputs is very large. Each input variable can be considered a dimension of a p-dimensional input space. For example, suppose you have two input variables x1 and x2, the input space would be 2-dimensional. With the increase in the number of dimensions, the volume of the input space increases at an exponential rate.In case of higher dimensions, the points which are similar may have large distances. All these points will be then away from each other and our intuition about 2 to 3 dimensional spaces will not be applicable. This kind of problem is called the “Curse of Dimensionality“. How is K in K-means different from K in K-NN?K-Means Clustering and k-Nearest Neighbors algorithm, both are commonly used algorithms in Machine Learning. They are often confused with each other, especially when we are talking about the k-factor. The ‘K’ in K-Means Clustering has nothing to do with the ‘K’ in K-NN algorithm. k-Means Clustering is an unsupervised learning algorithm that is used for clustering whereas K-NN is a supervised learning algorithm used for classification.K-Means AlgorithmThe k-means algorithm is an unsupervised clustering algorithm which takes a couple of unlabeled points and then groups them into “k” number of clusters.The “k” in k-means denotes the number of clusters you would like to have in the end. Suppose the value of k is 5, it means you will have 5 clusters on the data set.Let us see how it works.Step 1: First you determine the value of K by Elbow method and then specify the number of clusters KStep 2: Next you have to randomly assign each data point to a clusterStep 3: Determine the cluster centroid coordinatesStep 4: Determine the distances of each data point to the centroids and re-assign each point to the closest cluster centroid based upon minimum distanceStep 5: Calculate cluster centroids againStep 6: Repeat steps 4 and 5 until we reach global optima where further improvements are not possible and there is no provision to switch data points from one cluster to another.Implementation in Python#Finding the optimum number of clusters for k-means clustering Nc = range(1, 10) kmeans = [KMeans(n_clusters=i) for i in Nc] kmeans score = [kmeans[i].fit(x).score(x) for i in range(len(kmeans))] score pl.plot(Nc,score) pl.xlabel('Number of Clusters') pl.ylabel('Score') pl.title('Elbow Curve') pl.show()You can clearly see why it is called 'The elbow method' from the above graph, the optimum clusters is where the elbow occurs.Now that we have the optimum amount of clusters (k=3), we can move on to applying K-means clustering to the Iris dataset.#Implementation of K-Means Clustering model = KMeans(n_clusters = 3) model.fit(x) model.labels_ colormap = np.array(['Red', 'Blue', 'Green']) z = plt.scatter(x.sepal_length, x.sepal_width, x.petal_length, c = colormap[model.labels_])#Accuracy of K-Means Clustering accuracy_score(iris.target,model.labels_) 0.8933333333333333K-NN AlgorithmBy now, we already know that K-NN algorithm is a supervised classification algorithm. It takes into consideration a couple of labelled points and then uses those points to learn how to label other points. To be able to assign label to other points, K-NN algorithm looks for the closest neighbor of the new point and checks for voting. The most number of neighbors around the new point decide the label of the new point.The “k” in K-Nearest Neighbors is the number of neighbors it checks. It is supervised because it is trying to classify a point on the basis of the known classification of other points.Let us see how it works.Step 1: Firstly, you determine the value for K.Step 2: Then you calculate the distances between the new input (test data) and all the training data. The most commonly used metrics for calculating distance are Euclidean, Manhattan and MinkowskiStep 3: Sort the distance and determine k nearest neighbors based on minimum distance valuesStep 4: Analyze the category of those neighbors and assign the category for the test data based on majority voteStep 5: Return the predicted classImplementation using Pythonerror = [] # Calculating error for K values between 1 and 40 for i in range(1, 40): K-NN = KNeighborsClassifier(n_neighbors=i) K-NN.fit(X_train, y_train) pred_i = K-NN.predict(X_test) error.append(np.mean(pred_i != y_test)) plt.figure(figsize=(12, 6)) plt.plot(range(1, 40), error, color='black', linestyle='dashed', marker='o',     markerfacecolor='grey', markersize=10) plt.title('Error Rate K Value') plt.xlabel('K Value') plt.ylabel('Mean Error') Text(0, 0.5, 'Mean Error')Now we know for what values of ‘K’, the error rate will be less. Let’s fix k=5 and implement K-NN algorithm.#Creating training and test splits from sklearn.model_selection import train_test_split X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.20) #Performing Feature Scaling from sklearn.preprocessing import StandardScaler scaler = StandardScaler() scaler.fit(X_train) X_train = scaler.transform(X_train) X_test = scaler.transform(X_test) #Training K-NN with k=5 from sklearn.neighbors import KNeighborsClassifier classifier = KNeighborsClassifier(n_neighbors=5) classifier.fit(X_train, y_train) KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',                     metric_params=None, n_jobs=None, n_neighbors=5, p=2,                     weights='uniform') y_pred = classifier.predict(X_test) from sklearn.metrics import classification_report, confusion_matrix print(confusion_matrix(y_test, y_pred)) print(classification_report(y_test, y_pred)) [[10  0  0] [ 0  9  2] [ 0  1  8]]                 precision recall   f1-score   support     Iris-setosa        1.00         1.00       1.00       10 Iris-versicolor       0.90       0.82     0.86       11 Iris-virginica     0.80         0.89       0.84       9       accuracy                   0.90         30       macro avg     0.90       0.90     0.90     30   weighted avg    0.90       0.90     0.90       30Practical Applications of K-NNNow that we have we have seen how K-NN works, let us look into some of the practical applications of K-NN.Recommending products to people with similar interests, recommending movies and TV shows as per viewer’s choice and interest, recommending hotels and other accommodation facilities while you are travelling based on your previous bookings.Assigning credit ratings based on financial characteristics, comparing people with similar financial features in a database. By analyzing the nature of a credit rating, people with similar financial details, they would be assigned similar credit ratings.Should the bank give a loan to an individual? Would an individual default on his or her loan? Is that person closer in characteristics to people who defaulted or did not default on their loans?Some advanced examples could include handwriting detection (like OCR), image recognition and even video recognition.Some pros and cons of K-NNProsTraining phase of K-nearest neighbor classification is faster in comparison with other classification algorithms.Training of a model is not required for generalization.Simple algorithm — to explain and understand/interpret.High accuracy (relatively) — it is pretty high but not competitive in comparison to better supervised learning models.K-NN can be useful in case of nonlinear data.Versatile — useful for classification or regression.ConsTesting phase of K-nearest neighbor classification is slower and costlier with respect to time and memory. High memory requirement - Requires large memory for storing the entire training dataset.K-NN requires scaling of data because K-NN uses the Euclidean distance between two data points to find nearest neighbors.Euclidean distance is sensitive to magnitudes. The features with high magnitudes will weigh more than features with low magnitudes.Not suitable for large dimensional data.How to improve the performance of K-NN?Rescaling Data: K-NN performs much better if all of the data has the same scale. Normalizing your data to the range [0, 1] is a good idea. It may also be a good idea to standardize your data if it has a Gaussian distribution.Addressing Missing Data: Missing data will mean that the distance between samples can not be calculated. These samples could be excluded or the missing values could be imputed.Reducing Dimensionality: K-NN is suited for lower dimensional data. You can try it on high dimensional data (hundreds or thousands of input variables) but be aware that it may not perform as good as other techniques. K-NN can benefit from feature selection that reduces the dimensionality of the input feature space.In this article we have learned about the K-Nearest Neighbor algorithm, where we should use it, how it works and so on. Also, we have discussed about parametric and nonparametric machine learning algorithms, instance based learning, eager and lazy learning, advantages and disadvantages of using K-NN, performance improvement suggestions and have implemented K-NN in Python. To learn more about other machine learning algorithms, join our Data Science Certification course and expand your learning skill set and career opportunities.
Rated 4.5/5 based on 8 customer reviews

What is K-Nearest Neighbor in Machine Learning: K-NN Algorithm

9315
What is K-Nearest Neighbor in Machine Learning: K-NN Algorithm

If you are thinking of a simple, easy-to-implement supervised machine learning algorithm which can be used to solve both classification as well as regression problems, K-Nearest Neighbor (K-NN) is the perfect choice. Learning K-NN is a great way to introduce yourself to machine learning and classification in general. Also, you will find a lot of intense application of K-NN in data mining, pattern recognition, semantic searching, intrusion detection and anomaly detection.

K-Nearest Neighbors is one of the most basic supervised machine learning algorithms, yet very essential. A supervised machine learning algorithm is one of the types of machine learning algorithm which is dependent on labelled input data in order to learn a function which is capable of producing an output whenever a new unlabeled data is given as input.

In real life scenarios, K-NN is widely used as it is non-parametric which means it does not make any underlying assumptions about the distributions of data. With the business world entirely revolving around Data Science, it has become one of the most lucrative fields. Hence, the heavy demand for a Data Science Certification.

Parametric vs Non-parametric Methods

Let us look into how different is a parametric machine learning algorithm from a nonparametric machine learning algorithm.

Machine learning, in other words can be called as learning a function (f) which maps input variables (X) to the output variables (Y).

Y=f(X)

An algorithm learns about the target mapping function from the training data. As we are unaware of the form of the function, we have to evaluate various machine learning algorithms and figure out which algorithms perform better at providing an approximation of the underlying function.

Statistical Methods are classified on the basis of what we know about the population we are studying.

  • Parametric statistics is a branch of statistics which assumes that sample data comes from a population that follows a probability distribution based on a fixed set of parameters.
  • Nonparametric statistics is the branch of statistics that is not based solely on population parameters.

Parametric Machine Learning Algorithms

This particular algorithm involves two steps:

  1. Selecting a form for the function
  2. Learning the coefficients for the function from the training data

Let us consider a line to understand functional form for the mapping function as it is used in linear regression and simplify the learning process.

b0 + b1*x1 + b2*x2 = 0

Where b0, b1 and b2 are the coefficients of the line which control the intercept and slope, and x1 and x2 are two input variables.

All we have to do now is to estimate the coefficients of the line equation to get a predictive model for the problem. Now, the problem is that the actual unknown underlying function may not be a linear function like a line. In that case, the approach will give poor results. Some of the examples of parametric machine learning algorithms are mentioned below:

  • Logistic Regression
  • Linear Discriminant Analysis
  • Perceptron
  • Naive Bayes
  • Simple Neural Networks

Nonparametric Machine Learning Algorithms

Nonparametric methods always try to find the best fit training data while constructing the mapping function which also allows it to fit a large number of functional forms. Some of the examples of nonparametric machine learning algorithms are mentioned below:

  • k-Nearest Neighbors
  • Decision Trees like CART and C4.5
  • Support Vector Machines

The best example of nonparametric machine learning algorithms would be k-nearest neighbors algorithm which makes predictions based on the k most similar training patterns for a given set of new data instance. This method simply assumes that the patterns which are close are likely to be of similar type.


Parametric Machine Learning AlgorithmsNonparametric Machine Learning Algorithms
Benefits
  • Simple to understand and interpret results
  • Speed of learning from data in fast
  • Less training data is required

  • Flexible enough to fit a large number of functional forms
  • No assumptions about the underlying functions
  • Provides high performance for prediction

Limitations
  • Choosing a functional form constrains the method to the specified form
  • It has limited complexity and more suited to simpler problems
  • It is unlikely to match the underlying mapping function and results in poor fit

  • Requires more training data in order to estimate the mapping function
  • Due to more parameters to train, it is slower comparatively
  • There is a risk to overfit the training data

Method Based LearningMethod Based Learning in machine learning

There are several learning models namely:

  • Association rules based
  • Ensemble method based
  • Deep Learning based
  • Clustering method based
  • Regression Analysis based
  • Bayesian method based
  • Dimensionality reduction based
  • Kernel method based
  • Instance based

Let us understand what Instance Based Learning is all about.

Instance Based Learning (IBL)Instance Based Learning (IBL) in machine learning

  • Instance-Based methods are the simplest form of learning
  • Instance -Based learning is lazy learning
  • K-NN model works on identified instance
  • Instances are retrieved from memory and then this data is used to classify the new query instance
  • Instance based learning is also called memory-based or case-based

Under Instance-based Learning we have,

Nearest-neighbor classifier

Uses k “closest” points (nearest neighbors) for performing classification. For example: It’s how people judge by observing our peers. We tend to move with people of similar attributes.

Lazy Learning vs Eager Learning

Lazy LearningEager Learning
Simply stores the training data and waits until it is given a test tuple.Munges the training data as soon as it receives it.
It's slow as it calculates based on the current data set instead of coming up with an algorithm based on historical data.It's fast as it has pre-calculated algorithm.
Localized data so generalization takes time at every iteration.On the basis of training set ,it constructs a classification model before receiving new data to classify.

What is K-NN?

One of the biggest applications of K-Nearest Neighbor search is Recommender Systems. If you have noticed while you are shopping as a user on Amazon and you like a particular item, you are recommended with similar items.

What is K-NN? in Machine Leaning

It also recommends similar items bought by other users and other set of items which are often bought together. Basically, the algorithm compares the set of users who like each item and looks for similarity. This not only applies to recommending items or products but also recommending media and even advertisements to display to a user.

K nearest neighbors or K-NN Algorithm is a simple algorithm which uses the entire dataset in its training phase. Whenever a prediction is required for an unseen data instance, it searches through the entire training dataset for k-most similar instances and the data with the most similar instance is finally returned as the prediction.

This algorithm suggests that if you’re similar to your neighbours, then you are one of them. Let us consider a simple example, if apple looks more similar to peach, pear, and cherry (fruits) than monkey, cat or a rat (animals), then most likely apple is a fruit.

Nearest Neighbours algorithm has been in action for the last sixty years. It is mainly used in statistical estimation and pattern recognition, as a non-parametric method, for regression and classification. The main aim of the K-Nearest Neighbor algorithm is to classify a new data point by comparing it to all previously seen data points. The classification of the k most similar previous cases are used for predicting the classification of the current data point. It is a simple algorithm which stores all available cases and classifies new cases based on a similarity measure (e.g., distance functions).

When do we use K-NN algorithm?

K-NN algorithm can be used for applications which require high accuracy as it makes highly accurate predictions. The quality of predictions is completely dependent on the distance measure. Thus, this algorithm is suitable for applications for which you have sufficient domain knowledge so that it can help you select an appropriate measure.

As we have already seen K-NN algorithm is a type of lazy learning, the computation for the generation is postponed until classification which indeed increases the costs of computation compared to other machine learning algorithms. But still K-NN is considered to be the better choice for applications where accuracy is more important and predictions are not requested frequently.

K-NN can be used for both regression and classification predictive problems. However, in the industry it is mostly used in classification problems.

Generally we mainly look at 3 important aspects in order to evaluate any technique:

  1. Ease to interpret output
  2. Calculation time
  3. Predictive Power

Let us consider a few examples to place K-NN in the scale :

K-NN algorithm example scale in Machine Learning

If you notice the chart mentioned above, K-NN algorithm exceeds in most of the parameters. It is most commonly used for ease of interpretation and low calculation time.

How does the K-NN algorithm work?

K-NN algorithm works on the basis of feature similarity. The classification of a given data point is determined by how closely out-of-sample features resemble our training set.

How does the K-NN algorithm work? in Machine Learning

The above figure shows an example of k-NN classification. If you consider the nearest neighbor to the test sample, it is a blue square (Class 1) and k=1. This falls inside the inner circle.

Now, if you consider k=3, then you will see 2 red triangles and only 1 blue square falls under the outer circle. Thus, the test sample is classified as a red triangle now (Class 2).

Similarly, if you consider k=5, it is assigned to the first class (3 squares vs. 2 triangles outside the outer circle).

K-NN in Regression

In regression problems, K-NN is used for prediction based on the mean or the median of the K-most similar instances.

K-NN in Classification

K-nearest-neighbor classification was actually developed from the need to perform discriminant analysis when reliable parametric estimates of probability densities are unknown or are difficult to determine. When K-NN is used for classification, the output is easily calculated by the class having the highest frequency from the K-most similar instances. The class with maximum vote is taken into consideration for prediction.

The probabilities of Classes can be calculated as the normalized frequency of samples that belong to each class in the set of K most similar instances for a new data instance.

For example, in a binary classification problem (class is 0 or 1):

p(class=0) = count(class=0) / (count(class=0)+count(class=1))

If you are using K and you have an even number of classes (e.g. 2) it is a good idea to choose a K value with an odd number to avoid a tie. And the inverse, use an even number for K when you have an odd number of classes.

Ties can be broken consistently by expanding K by 1 and looking at the class of the next most similar instance in the training dataset.

Making Predictions with K-NN

A case can be classified by a majority vote of its neighbors. The case is then assigned to the most common class amongst its K nearest neighbors measured by a distance function. Suppose the value of K is 1, then the case is simply assigned to the class of its nearest neighbor.

Distance functions in Machine Learning

The three distance measures mentioned above are valid only for continuous variables. For categorical variables, the Hamming distance is used. It also brings up the issue of standardization of the numerical variables between 0 and 1 when there is a mixture of numerical and categorical variables in the dataset.

Hamming Distance In Machine Learning

By inspecting the data, you can choose the best optimal value for K. Generally, a large value of K is more accurate as it tends to reduce the overall noise but is not always true. Another way to retrospectively determine a good K value by using an independent dataset to validate the K value is Cross-validation. According to observation, the optimal K for most datasets has been between 3-10 which provides better results than 1NN.

For example, let us consider an example where the data mentioned below us concerned with credit default. Age and Loan are two numerical variables (predictors) and Default is the target.

Making Predictions with K-NN In Machine Learning

By observing the data mentioned above, we can use the training set in order to classify an unknown case (Age=48 and Loan=$142,000) using Euclidean distance. If K=1 then the nearest neighbor is the last case in the training set with Default=Y.

Making Predictions with K-NN

AgeLoanDefaultDistance
25$40,000N102000
35$60,000N82000
45$80,000N62000
20$20,000N122000
35$120,000N220002
52$18,000N124000
23$95,000Y47000
40$62,000Y80000
60$100,000Y420003
48$220,000Y78000
33$150,000Y80001





48$142,000
?

Euclidean DistanceEuclidean Distance

With K=3, there are two Default=Y and one Default=N out of three closest neighbors. The prediction for the unknown case is again Default=Y.

Standardized Distance

One major drawback in calculating distance measures directly from the training set is in the case where variables have different measurement scales or there is a mixture of numerical and categorical variables. For example, if one variable is based on annual income in dollars, and the other is based on age in years then income will have a much higher influence on the distance calculated. One solution is to standardize the training set as shown below.

AgeLoanDefaultDistance
0.1250.11N0.7652
0.3750.21N0.5200
0.6250.31N
0.3160
00.01N0.9245
0.3750.50N0.3428
0.80.00N0.6220
0.0750.38Y0.6669
0.50.22Y0.4437
10.41Y0.3650
0.71.00Y0.3861
0.3250.65Y0.3771




0.7
0.61
?

Standardized VariableStandardized Variable

Using the standardized distance on the same training set, the unknown case returned a different neighbor which is not a good sign of robustness.

Between-sample geometric distance

The k-nearest-neighbor classifier is commonly based on the Euclidean distance between a test sample and the specified training samples. Let  xbe an input sample with features, (xi1, xi2, …, xip), be the total number of input samples (i=1,2,…,n) and the total number of features (j=1,2,…,p) . The Euclidean distance between sample xand x(l=1,2,…,n) is defined as:

Standardized Variable

A graphical representation of the nearest neighbor concept is illustrated in the Voronoi tessellation. The tessellation shows 19 samples marked with a "+", and the Voronoi cell, R, surrounding each sample. A Voronoi cell encapsulates all neighboring points that are nearest to each sample and is defined as:

Standardized Variable

Where Ris the Voronoi cell for sample xi, and represents all possible points within Voronoi cell Ri.

Standardized geometric distance in Machine LearningVoronoi tessellation showing Voronoi cells of 19 samples marked with a "+"

The Voronoi tessellation reflects two characteristics of the example 2-dimensional coordinate system: i) all possible points within a sample's Voronoi cell are the nearest neighboring points for that sample, and ii) for any sample, the nearest sample is determined by the closest Voronoi cell edge.

According to the latter characteristic, the k-nearest-neighbor classification rule is to assign to a test sample the majority category label of its k nearest training samples. In practice, k is usually chosen to be odd, so as to avoid ties. The k = 1 rule is generally called the nearest-neighbor classification rule.

Curse of Dimensionality

The curse of dimensionality refers to various phenomena that are witnessed while analyzing and organizing data in high-dimensional spaces (often with hundreds or thousands of dimensions). Such phenomenon do not occur in low-dimensional settings such as the three-dimensional physical space of everyday experience.

K-NN algorithm will work absolutely fine when you are dealing with a small number of input variables (p)  but will struggle when there are a large number of inputs.

K-NN works well with a small number of input variables (p), but struggles when the number of inputs is very large. Each input variable can be considered a dimension of a p-dimensional input space. For example, suppose you have two input variables x1 and x2, the input space would be 2-dimensional. With the increase in the number of dimensions, the volume of the input space increases at an exponential rate.

In case of higher dimensions, the points which are similar may have large distances. All these points will be then away from each other and our intuition about 2 to 3 dimensional spaces will not be applicable. This kind of problem is called the “Curse of Dimensionality“. 

How is K in K-means different from K in K-NN?

K-Means Clustering and k-Nearest Neighbors algorithm, both are commonly used algorithms in Machine Learning. They are often confused with each other, especially when we are talking about the k-factor. The ‘K’ in K-Means Clustering has nothing to do with the ‘K’ in K-NN algorithm. k-Means Clustering is an unsupervised learning algorithm that is used for clustering whereas K-NN is a supervised learning algorithm used for classification.

K-Means Algorithm

The k-means algorithm is an unsupervised clustering algorithm which takes a couple of unlabeled points and then groups them into “k” number of clusters.

The “k” in k-means denotes the number of clusters you would like to have in the end. Suppose the value of k is 5, it means you will have 5 clusters on the data set.

Let us see how it works.

Step 1: First you determine the value of K by Elbow method and then specify the number of clusters K

Step 2: Next you have to randomly assign each data point to a cluster

Step 3: Determine the cluster centroid coordinates

Step 4: Determine the distances of each data point to the centroids and re-assign each point to the closest cluster centroid based upon minimum distance

Step 5: Calculate cluster centroids again

Step 6: Repeat steps 4 and 5 until we reach global optima where further improvements are not possible and there is no provision to switch data points from one cluster to another.

Implementation in Python

#Finding the optimum number of clusters for k-means clustering
Nc = range(1, 10)
kmeans = [KMeans(n_clusters=i) for i in Nc]
kmeans
score = [kmeans[i].fit(x).score(x) for i in range(len(kmeans))]
score
pl.plot(Nc,score)
pl.xlabel('Number of Clusters')
pl.ylabel('Score')
pl.title('Elbow Curve')
pl.show()

Python Implementation in Machine Learning

You can clearly see why it is called 'The elbow method' from the above graph, the optimum clusters is where the elbow occurs.

Now that we have the optimum amount of clusters (k=3), we can move on to applying K-means clustering to the Iris dataset.

#Implementation of K-Means Clustering
model = KMeans(n_clusters = 3)
model.fit(x)
model.labels_
colormap = np.array(['Red', 'Blue', 'Green'])
z = plt.scatter(x.sepal_length, x.sepal_width, x.petal_length, c = colormap[model.labels_])

The elbow method

#Accuracy of K-Means Clustering
accuracy_score(iris.target,model.labels_)
0.8933333333333333

K-NN Algorithm

By now, we already know that K-NN algorithm is a supervised classification algorithm. It takes into consideration a couple of labelled points and then uses those points to learn how to label other points. To be able to assign label to other points, K-NN algorithm looks for the closest neighbor of the new point and checks for voting. The most number of neighbors around the new point decide the label of the new point.

The “k” in K-Nearest Neighbors is the number of neighbors it checks. It is supervised because it is trying to classify a point on the basis of the known classification of other points.

Let us see how it works.

Step 1: Firstly, you determine the value for K.

Step 2: Then you calculate the distances between the new input (test data) and all the training data. The most commonly used metrics for calculating distance are Euclidean, Manhattan and Minkowski

Step 3: Sort the distance and determine k nearest neighbors based on minimum distance values

Step 4: Analyze the category of those neighbors and assign the category for the test data based on majority vote

Step 5: Return the predicted class

Implementation using Python

error = []
# Calculating error for K values between 1 and 40
for i in range(1, 40):
K-NN = KNeighborsClassifier(n_neighbors=i)
K-NN.fit(X_train, y_train)
pred_i = K-NN.predict(X_test)
error.append(np.mean(pred_i != y_test))
plt.figure(figsize=(12, 6))
plt.plot(range(1, 40), error, color='black', linestyle='dashed', marker='o',
    markerfacecolor='grey', markersize=10)
plt.title('Error Rate K Value')
plt.xlabel('K Value')
plt.ylabel('Mean Error')
Text(0, 0.5, 'Mean Error')

Error Rate K value Python Implementation in Machine Learning

Now we know for what values of ‘K’, the error rate will be less. Let’s fix k=5 and implement K-NN algorithm.

#Creating training and test splits
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.20)
#Performing Feature Scaling
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
scaler.fit(X_train)

X_train = scaler.transform(X_train)
X_test = scaler.transform(X_test)

#Training K-NN with k=5
from sklearn.neighbors import KNeighborsClassifier
classifier = KNeighborsClassifier(n_neighbors=5)
classifier.fit(X_train, y_train)
KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
                    metric_params=None, n_jobs=None, n_neighbors=5, p=2,
                    weights='uniform')
y_pred = classifier.predict(X_test)
from sklearn.metrics import classification_report, confusion_matrix
print(confusion_matrix(y_test, y_pred))
print(classification_report(y_test, y_pred))
[[10  0  0]
 [ 0  9  2]
 [ 0  1  8]]
                       precision    recall     f1-score     support
    Iris-setosa        1.00         1.00       1.00         10
Iris-versicolor        0.90         0.82       0.86         11
 Iris-virginica        0.80         0.89       0.84         9
       accuracy                                0.90         30
      macro avg        0.90         0.90       0.90         30
   weighted avg        0.90         0.90       0.90         30

Practical Applications of K-NN

Now that we have we have seen how K-NN works, let us look into some of the practical applications of K-NN.

  • Recommending products to people with similar interests, recommending movies and TV shows as per viewer’s choice and interest, recommending hotels and other accommodation facilities while you are travelling based on your previous bookings.
  • Assigning credit ratings based on financial characteristics, comparing people with similar financial features in a database. By analyzing the nature of a credit rating, people with similar financial details, they would be assigned similar credit ratings.
  • Should the bank give a loan to an individual? Would an individual default on his or her loan? Is that person closer in characteristics to people who defaulted or did not default on their loans?
  • Some advanced examples could include handwriting detection (like OCR), image recognition and even video recognition.

Some pros and cons of K-NN

Pros

  • Training phase of K-nearest neighbor classification is faster in comparison with other classification algorithms.
  • Training of a model is not required for generalization.
  • Simple algorithm — to explain and understand/interpret.
  • High accuracy (relatively) — it is pretty high but not competitive in comparison to better supervised learning models.
  • K-NN can be useful in case of nonlinear data.
  • Versatile — useful for classification or regression.

Cons

  • Testing phase of K-nearest neighbor classification is slower and costlier with respect to time and memory. 
  • High memory requirement - Requires large memory for storing the entire training dataset.
  • K-NN requires scaling of data because K-NN uses the Euclidean distance between two data points to find nearest neighbors.
  • Euclidean distance is sensitive to magnitudes. The features with high magnitudes will weigh more than features with low magnitudes.
  • Not suitable for large dimensional data.

How to improve the performance of K-NN?

  • Rescaling Data: K-NN performs much better if all of the data has the same scale. Normalizing your data to the range [0, 1] is a good idea. It may also be a good idea to standardize your data if it has a Gaussian distribution.
  • Addressing Missing Data: Missing data will mean that the distance between samples can not be calculated. These samples could be excluded or the missing values could be imputed.
  • Reducing Dimensionality: K-NN is suited for lower dimensional data. You can try it on high dimensional data (hundreds or thousands of input variables) but be aware that it may not perform as good as other techniques. K-NN can benefit from feature selection that reduces the dimensionality of the input feature space.

In this article we have learned about the K-Nearest Neighbor algorithm, where we should use it, how it works and so on. Also, we have discussed about parametric and nonparametric machine learning algorithms, instance based learning, eager and lazy learning, advantages and disadvantages of using K-NN, performance improvement suggestions and have implemented K-NN in Python. To learn more about other machine learning algorithms, join our Data Science Certification course and expand your learning skill set and career opportunities.

Priyankur

Priyankur Sarkar

Data Science Enthusiast

Priyankur Sarkar loves to play with data and get insightful results out of it, then turn those data insights and results in business growth. He is an electronics engineer with a versatile experience as an individual contributor and leading teams, and has actively worked towards building Machine Learning capabilities for organizations.

Join the Discussion

Your email address will not be published. Required fields are marked *

Suggested Blogs

A Guide to Using AI Responsibly

"The artificial intelligence (AI) that we develop is impacting people directly or indirectly. And often, they don’t deserve the consequences of those impacts, nor have they asked for it. It is incumbent upon us to make sure that we are doing the right thing”.- Dr. Anthony Franklin, Senior Data Scientist and AI Engineer, MicrosoftDigitally addressing a live global audience in a recent webinar on the topic of ‘Responsible AI’, Dr. Anthony Franklin, a senior data science expert and AI evangelist from Microsoft, spoke about the challenges that society faces from the ever-evolving AI and how the inherent biased nature of humans is reflected in technology.Drawing from his experience in machine learning, risk analytics, analytics model management in government as well as data warehouse, Dr. Franklin shed light on the critical need to incorporate ethics in developing AI. Citing examples from various incidents that have taken place around the world, Dr. Franklin emphasized why it is critical for us to have an uncompromising approach towards using AI responsibly. He talked about the human (over)indulgence in technology, the challenges that society faces from the ever-evolving AI and how the inherent biased nature of humans is reflected through technology.The purpose of the talk and this article is to help frame the debate on responsible AI with a set of principles we can anchor on, and a set of actions we can all take to advance the promise of AI in ways that don’t cause harm to people. In this article, we present key insights from the webinar along with the video for you to follow along.KnowledgeHut webinar on Responsible AI by Dr. Anthony Franklin, MicrosoftWhat is the debate about?These are times when we can expect to see policemen on the streets wearing AI glasses, viewing, and profiling the public. Military organizations today, can keep an eye on the public. Besides, a simple exercise of googling the word CEO, would result in pages and pages showing white men.Police using AI glasses for public surveillance in ChinaThese are just some of the examples of the unparalleled success we have achieved in technology coupled with the fact that the same technology has overlooked the basic ethics, moral and social.Responsible AI is a critical global needIn a recent study conducted from among the top ten technologically advanced nations, nearly nine of ten organizations across countries have encountered ethical issues resulting from the use of AI.Source: Capgemini    Artificial intelligence has captured our imagination and made many things we would have thought impossible only a few years ago seem commonplace today. But AI has also raised some challenging issues for society writ large. We are in a race to advance AI capabilities and everything is about collecting data. But, what is being done with the data?Advancements in AI are different from other technologies because of the pace of innovation and its proximity to human intelligence – impacting us at a personal and societal level.While there remains no end to this ever-ending road of development, the need for us to ensure an equally powerful framework has increased even more. The need for a responsible AI is a critical global need.What developers are saying about ethics in AIStack Overflow carried out a couple of anonymous developer focused surveys in 2018. Some of the responses are a clear indication of how the machine is often so powerful. While we wish the answers were all "No", the actual answers are not too surprising.1. What would the developers do if asked to write a code for an unethical purpose?The majority (58.5 percent) stated they would clearly decline if they were to be approached to write code for an unethical purpose. Over a third (37 percent), however, said they would do if it met some specific criteria of theirs.2. Who is ultimately responsible for the code which accomplishes something unethical?When asked with whom the ultimate responsibility lies if their code were to be used to accomplish something unethical, nearly one fifth of the developers acknowledge that such a responsibility should lie with the developer who wrote the code. 23 percent of the developers stated that this accountability should lie with the person who came up with the idea. The majority (60 percent), however, felt that the senior management should be responsible for this.3. Do the developers have an obligation to consider the ethical implications?A significant majority (80 percent) acknowledged that developers have the obligation to consider ethical implications. Though in smaller numbers, the above studies show the ability of the developers to get involved in unethical activity and the tendency to brush off accountability. Thus, there is a great and growing need not just for developers, but also for the rest of us to work collectively to change these numbers.The six basic principles of AIThough ambiguous, the principles attached with the ethics of AI remain very much tangible. Following are the six basic principles of AI:1. FairnessFairness (noun)the state, condition, or quality of being fair, or free from bias or injustice; evenhandednessDiscriminationOne of the many services which Amazon provides today includes the same-day-shipping policy. The map below shows the reach of the policy in the top 6 metropolitans in the US.Source: Bloomberg   In the city of Boston, one can see the gaps, the places where the service is not provided. Coincidentally, these areas turned out to be areas inhabited by individuals belonging to the lower economic strata. In defence, the Amazon stated that the policy was meant primarily for regions with denser Amazon users. Whichever way this is seen, the approach still ends up being discriminatory.We see examples of bias in search as well. When we search for “CEO” in Bing, we see that all pictures are pictures of mostly white men, creating the impression that there are no women CEOs.RacismWe see examples of bias across different applications of AI. An image of an Asian American was submitted for the purpose of renewing the passport. After analysing the subject, the application’s statement read “Subjects eyes are closed”.This highlights the unintentional, but negatively impactful working of a data organization. It further goes on to show how an inherent bias held by humans, transcends into the technology we make.An algorithm widely used in US hospitals to allocate healthcare to patients has been systematically discriminating against black people, a sweeping analysis has found.The study, published in Science in October 2019, concluded that the algorithm was less likely to refer black people than white people who were equally sick, to programmes that aim to improve care for patients with complex medical needs. Hospitals and insurers use the algorithm and others like it to help manage care for about 200 million people in the United States each year.As a result, millions of black people have not been able to get equal medical treatment. To make things worse, data suggests that in some way or the other, the algorithms have been set up to make money.In 2015, Google became one of the first to release a facial recognition programme. The system recognized the Caucasians perfectly well, but the same system identified a black person with an ape.These examples of bias in technologies are not isolated from the society we live in. The society we live in has different forms of biases that may not consistent with a corporation’s values, but these biases may already be prevalent in their data sets.With the widespread use of AI and statistical learning, such enterprises are at serious risk not only of spreading but also amplifying these biases in ways that they do not understand.These examples demonstrate gross unfairness on multiple fronts, making it necessary for organizations to have a more diverse data in general.2. Reliability and SafetyReliability (noun)the ability to be relied on or depended on, as for accuracy, honesty, or achievement.Safety (noun)the state of being safe; freedom from the occurrence or risk of injury, danger, or loss. the quality of averting or not causing injury, danger, or loss.In the case of an autonomous vehicle, when can we as a consumer be 100% sure of our safety? Or can we ever be? How many miles does a car have to cover or how many people are to lose their lives before the assurance of the rest?In the case of autonomous vehicles, how can we as consumers be 100 percent sure of our safety? Or can we ever be? How many miles does a car have to cover or how many people are to lose their lives before the assurance of the rest? These are just a few of the questions a company must answer before establishing themselves as a reliable organization.A project from scientists in the UK and India shows one possible use for automated surveillance technology to identify violent behavior in crowds with the help of camera-equipped drones.In a paper titled “Eye in the Sky,” the researchers used uses a simple Parrot AR quadcopter (which costs around $200) to transmit video footage over a mobile internet connection for real-time analysis. A figure from the paper showing how the software analyzes individuals poses and matches them to “violent” postures. The question is: how will this technology be used, and who will use it?Researchers working in this field often note there is a huge difference between staged tests and real-world use-cases. Though this system is yet to prove itself, it is a clear illustration of the direction contemporary research is going.Using AI to identify body poses is a common problem, with big tech companies like Facebook publishing significant research on the topic. Many experts agree that automated surveillance technologies are ripe for abuse by law enforcement and authoritarian governments.3. Privacy and securityPrivacy (noun)the state of being apart from other people or concealed from their view; solitude; seclusion:the state of being free from unwanted or undue intrusion or disturbance in one's private life or affairs; freedom to be let alone:Security (noun)freedom from danger, risk, etc.; safety.freedom from care, anxiety, or doubt; well-founded confidence.something that secures or makes safe; protection; defense.Strava’s heat map revealed military bases around the world and exposed soldiers to real danger – this is not AI per se, but useful for a data discussion. A similar instance took place in Russia, too.The iRobot’s latest Roomba’s i7+ Robovac maps users’ homes to let them customize the cleaning schedule. An integration with Google Assistant lets customers give verbal commands like, “OK Google, tell Roomba to clean the kitchen.” - this is voluntary action and needs user’s consent.Roomba’s i7+ Robovac maps users’ homes to let them customize the cleaning scheduleIn October 2018, the company admitted it had exposed the personal data of around 500,000 Google+ users, leading to the closure of the platform. It also announced it was reviewing access to Gmail by third-party companies after it was revealed that many developers were reading and analyzing users’ personal mail for marketing and data mining.A 2012 New York Times article, spoke about a father who found himself in the uncomfortable position of having to apologize to a Target employee. Earlier, he had stormed into a store near Minneapolis and complained to the manager that his daughter was receiving coupons for cribs and baby clothes in the mail. It turned out that Target knew his teen daughter better than he did. She was pregnant and Target knew this before her dad did.By crawling the teen’s data, statisticians at Target were able to identify about 25 products that, when analysed together, allowed them to assign each shopper a “pregnancy prediction” score. More importantly, they could also estimate her due date to within a small window, so they could send coupons timed to very specific stages of her pregnancy.There was another instance reported in Canada of a mall using facial recognition software in their directories June to track shoppers' ages and genders without telling them.4. InclusivenessInclusiveness (adjective)including or encompassing the stated limit or extremes in consideration or account (usually used postpositively)including a great deal, or encompassing everything concerned; comprehensiveIn the K.W vs Armstrong case, the plaintiffs were vulnerable adults living in Idaho, facing various psychological and developmental disabilities. They complained to the court when the Idaho Department of Health and Welfare reduced their medical assistance budget by a whopping 42%.The Idaho Department of Health and Welfare claimed that the reasons for the cuts were “trade secrets” and refused to disclose the algorithm it used to calculate the reductions.K.W. v. Armstrong plaintiff, Christie MathwigOnce a system is found to be discriminatory or otherwise inaccurate, there is an additional challenge in redesigning the system. Ideally, government agencies should develop an inclusive redesign process that allows communities affected by algorithmic decision systems to meaningfully participate. But this approach is frequently met with resistance.5. TransparencyTransparency (adjective)having the property of transmitting rays of light through its substance so that bodies situated beyond or behind can be distinctly seen.admitting the passage of light through interstices.so sheer as to permit light to pass through; diaphanous.easily seen through, recognized, or detectedA company in New Orleans assisted the police officials to predict the individuals and their likelihood of committing crimes. This is the example of the usage of predictive analytics for policing strategies, carried out secretively.In the Rich Caruana case study, 10 million patients data, and 1000’s of features were used to train a model on the data to predict the risk of pneumonia and decide whether patients must be sent to hospital. But was this model safe to deploy and use on real patients? Was the test data sufficient to make accurate predictions?Unfortunately, a bunch of different machine learning models had been used to train an accurate black box, without knowing what was inside. Multitask neural net was thought to be the most accurate, but was the approach safe?The pattern in the data, strictly speaking, was accurate. The good news was that the treatment was so effective that it lowered the risk of dying compared to the general population. However, the bad news was that if we used this model to make decisions about whether to admit the patient to the hospital, it would be dangerous to asthmatics and hence, not at all safe to use.Not only is this an issue of safety, but also a case of violation of transparency. The key problem is that there are bad patterns we don’t know about. While neural net is more accurate and can learn things fast, one doesn’t know everything that the neural net is using. We really need to understand the model before we deploy it.Now, through a technique called Generalized Additive Models, whereby the influence of individual attributes in the training data can be independently measured, a new model has been trained where the outputs are completely transparent, but actually improved performance over the old model.Asthmatics were now being sent home sooner because they were rushed to the front of the line as soon as they arrived at the hospital. Faster and more targeted care led to better results. And all the model learned from were the results.In another instance, one of the tools used by the New Orleans Police Department to identify members of gangs like 3NG and the 39ers came from the Silicon Valley company Palantir. The company provided software to a secretive NOPD program that traced people’s ties to other gang members, outlined criminal histories, analyzed social media, and predicted the likelihood that individuals would commit violence or become a victim.As part of the discovery process in the trial, the government turned over more than 60,000 pages of documents detailing evidence gathered against him from confidential informants, ballistics, and other sources — but they made no mention of the NOPD’s partnership with Palantir.6. AccountabilityAccountability (adjective)subject to the obligation to report, explain, or justify something; responsible; answerable. capable of being explained; explicable; explainable.Like in the example of autonomous vehicles, in case of any mishap, where does the accountability lie? Who is to be blamed for the loss of lives or any sort of destruction in a driverless car?With driverless cars, the question remains: Who is to blame?It appears that the more advanced the technology, the faster it is losing its accountability. Be it a driverless car crashing or a robot killing a person, the question remains: who is to blame.Whom does one sue if I were to get hit by a driverless car? What if a medical robot gives a patient the wrong drug? What if a vacuum robot sucks up one's hair while they are napping on the floor? And can a robot commit a war crime? Who gets to decide whether a person deserves certain treatment in an algorithm-based health care policy? Is it the organization which developed it or the developer who made it? There is a clear case of lack of accountability in such situations.Liability of automated systems, the debate continues.The key word in the above-mentioned principles is impact. The consequence of any AI programming, intentional or unintentional, leaves a strong impact.The responsible AI lifecycleBoth the Association for Computing Machinery (ACM) and the Institute of Electrical and Electronics Engineers (IEEE) published ethics guidelines for computer scientists in the early 1990s. More recently, we have seen countless social scientists and STS researchers sounding the alarm about technology’s potential to harm people and society.To turn talk about responsible AI into action, organisations need to make sure that their use of AI fulfils several criteria. After defining the basic AI principles, an organization can develop a prototype. But they must be open to change even after launching what they assume to be the most fool-proof AI service.Microsoft’s Responsible AI Lifecycle is built on six key principles, namely:Define: Define the objectives, data requirements and responsible metrics.Envision: Consider the consequences and potential risks by continually analysing and improving.Prototype: Build prototypes based on data, models and experience, and test frequently.Build: Build and integrate AI according to responsible metrics and trade-offs.Launch: Launch only after diverse ring-testing with escalation and recovery planEvolve: By continuously analysing and improving.Responsible AI Lifecycle. Source: MicrosoftMicrosoft is leading the way with detailed guidelines to help teams put responsible AI into practice. Their Guidelines for Human-AI Interaction recommend best practices for how AI systems should behave upon initial interaction, during regular interaction, when they’re inevitably wrong, and over time. They are to be used throughout the design process as existing ideas are evaluated, new ideas are being brainstormed, and collaboration undertaken across multiple disciplines in creating AI.In addition, there are several types of guidelines given to engineering teams including conversational AI guidelines, inclusive design guidelines, an AI fairness checklist, and an AI security engineering guidance.All guidelines are designed to help teams anticipate and address potential issues throughout the software development lifecycle to mitigate security, risks, and ethics issues.Principles to practicesAI is already having an enormous and positive impact on healthcare, the environment, and a host of other societal needs. These rapid advances have given rise to an industry debate about how the world should (or shouldn’t) use these new capabilities. As these systems become increasingly important to our lives, it is critical that when they fail that we understand how and why, whether it is inherent design of a system or the result of an adversary. In conclusion, Dr. Franklin emphasized the need for enterprises to understand how bias can be introduced and affect recommendations. Attracting a diverse pool of AI talent across the organization is critical to develop analytical techniques to detect and eliminate bias, he stressed.We hope Dr. Franklin's webinar and this article have helped frame the debate on responsible AI and provided us with a set of principles we can anchor on, and a set of actions we can take to advance the promise of AI in ways that don’t cause harm to people.
Rated 4.5/5 based on 1 customer reviews
4594
A Guide to Using AI Responsibly

"The artificial intelligence (AI) that we develop ... Read More

How to Get the Best Out of Your Machine Learning Course

As a programmer, you understand well how a program works: it runs based on certain commands and statements written by you. However, some smart people asked whether it would be possible for a program to learn things depending upon past experiences and improve its decision-making ability to enhance its overall performance.This is the most fundamental and simplified version of the idea of Machine Learning.What is Machine Learning?The term “Machine Learning” was coined by American Pioneer, Arthur Samuel. Arthur defined Machine Learning as “the field of study that gives computers the ability to learn without being programmed explicitly”.In simple terms, Machine Learning is the science of getting things done using intelligent machines. It is a subset of Artificial Intelligence. It teaches a computer system to make precise predictions when some data is given as input. A Machine Learning model can make predictions by answering questions like whether a piece of fruit in a picture is an orange or mango, whether an email you received is spam or not, or recognizing speeches in a YouTube video for generating captions.A Machine Learning algorithm is fed with data and information in the form of observations and real-world interactions. It studies the data available and improves its learning over time in its own way until the algorithm can make decisions and predictions.The applications of Machine Learning are widely used in several sectors ranging from science, telecom, healthcare, production, and so on.How to learn and grow in Machine Learning?If you want to become an expert in Machine Learning, you need to follow several steps which require you to invest a significant amount of time to learn about the principles behind it and acquire a firm grasp on it. The steps to learn Machine Learning in the most efficient way is described below.Understand the basicsMachine Learning is a deep domain technology and before you get started with ML, you need to spend a couple of weeks grasping the “general and basic knowledge” about the field of Machine Learning.In the beginning phase, you should become well aware of the detailed and correct answers to the following questions:What is Machine Learning?What is the capability of Machine Learning?What are the merits of learning Machine Learning?What are the limitations of Machine Learning?What are the applications of Machine Learning?After you have gathered the fundamentals, you can head on to the other related domains which are often associated with Machine Learning: Analytics, Data Science, Big Data, and Artificial Intelligence.If you want to become an expert, you need to interpret the finer details of all the topics mentioned earlier. Try to understand the concepts in your own specific manner so that you can explain it in a simple way to just about anyone.Recommended exerciseWrite a blog about “The Basics of Machine Learning” on any blogging website. Your article must answer questions about Machine Learning considering that it is asked in an interview.Learn StatisticsData plays a very important role in the field of Machine Learning. In your Machine Learning career, you will have to spend most of your time working with data. This is where statistics comes in picture.Statistics is a field of mathematics that deals with the collection and analysis of data and also explains how you can present your data efficiently. It is a prerequisite for understanding Machine Learning deeply.Though it is said that you can achieve to be a Machine Learning expert without any such expertise in statistics, it is also considered that you cannot completely avoid statistical concepts when the question is about Machine Learning and Data Science.The concepts you need to learn in the domain of statistics are –Significance of StatisticsData Structures and VariablesBasic principles of ProbabilityProbability DistributionsHypothesis TestingRegression modelYou can also gather information about the Bayesian model and its various concepts which tend to be an essential part of Machine Learning.Recommended exerciseAs an exercise in Statistics, you can create a list of references for each topic mentioned above which will explain them in the easiest manner and then put it out in a blog.Learn Python or RIf you want to become a master in any programming language, it could well take an eternity. However, in your quest of becoming a Machine Learning expert, you need to get familiar with learning a language. Experts say this it is not too difficult.There are numerous languages like Java, C, C++, Scala, Python, R, etc. by which you can implement your Machine Learning algorithms. However, Python and R are the most popular languages, and learning one can certainly make it easy to learn the other.Most of the experts prefer Python since it is easier to build Machine Learning models in this language than any other programming language. While Python is best for writing code related to Machine Learning, but when it comes to managing a huge amount of data for a Machine Learning project, experts suggest R.Python also offers certain libraries that are specifically built for Machine Learning like Keras, TensorFlow, Scikit-learn, etc. Thus, it can be said learning both Python and R can be an upper hand in your journey of becoming a Machine Learning expert.Learn Machine Learning concepts and algorithmsNow that we have covered the prerequisites, let us reach out to the heart of Machine Learning. Algorithms are an important part in the world of programming. You need to learn about all the algorithms particularly designed for Machine Learning and the applications of these algorithms in your projects.Machine Learning is a wide field of study and algorithms act as the bread and butter in your journey of learning it. Along with Machine Learning algorithms, you should also know about the types and building blocks of Machine Learning:Supervised LearningUnsupervised LearningSemi-supervised LearningReinforcement LearningData PreprocessingEnsemble LearningModel EvaluationSampling & SplittingLearn about all the concepts in detail such as what do they mean and why they are used in Machine Learning.Create Learning modelsThe most fundamental idea of any Machine Learning model is that the model is given a large amount of data as input and the corresponding output is also supplied to them. Here, we will take into consideration the two common Machine Learning models - Unsupervised learning model and Supervised learning model.Unsupervised learning is a Machine Learning technique where the model works on its own to discover information. It uses unlabeled data and then finds the internal pattern in the data to learn more and more about the data itself. It can be used in a situation where you are given data about different countries as input and you need to find out the countries similar to each other based on a particular factor like population or health.Some of the concepts you need to learn about Unsupervised learning algorithms are –What is Clustering?What are the types of Clustering?What are Association rules?A supervised learning algorithm is a Machine Learning algorithm that takes place in the presence of a supervisor or a teacher. The training dataset is well labeled, and this learning process goes on until the required performance is obtained. It is useful in a situation where you need to identify if someone is likely to acquire a disease depending upon factors like lifestyle and habits.Some of the concepts you need to learn about Supervised learning algorithms are –What are Regressions?What are Classification Trees?What are Vector Machines?Recommended exerciseAs an exercise on learning models, you can take a certain dataset and create models with the help of all the algorithms you have learned. Train and test each of the models to enhance their performance.Participate in competitionsData Science competitions provide a certain platform to interact and compete in solving real-world problems since most data scientist’s work is theoretical and they lack the skill of working with real-world data.Competitions are the best place to learn and augment your skills in Machine Learning and they also act as an opportunity to enhance boundaries and promote creativity among the brightest minds. The experience you gather from these competitions will help you to develop the most feasible solutions while working with big data.Some of the most popular data competitions to practice Machine Learning algorithms are listed below:KaggleInternational Data Analysis Olympiad (IDAHO)TopcoderDataHack and DSATMachine HackLearn about deep learning modelsDeep Learning is a subfield of Machine Learning which is more powerful and flexible since its process of learning considers the world as a series of concepts where each concept is explained with some other simpler concepts.The popularity of Deep Learning is because it is power-driven by a huge amount of data. Smartphone assistants like Google Assistant or Siri were created with the help of deep learning models. They also helped global companies to build self-driving cars.Machines in this era can perform all the basic things that a human can perform like see, listen, read, write, and even speak to deep learning models. They are also a great influence on enhancing the skill set of people working on Artificial Intelligence.Some of the topics you can cover to gather detailed insights about deep learning models are –What are Neural Networks?What is Natural Language Processing or NLP?What is TensorFlow?What is OpenCV?Recommended exerciseCreate a model that can identify a flower from a fruit.Learn about Big Data technologiesBig Data refers to the large volume of structured and unstructured data that business giants use for analyzing insights to make better decisions. A massive amount of data is used in day-to-day applications and managing such a huge amount of data is possible because of Big Data.Big Data uses analytical techniques like Machine Learning, Statistics, Data Mining, etc. to perform multiple operations on a single platform. It allows storing, processing, analyzing, and visualizing data with the help of different tools.Big Data technologies provide meaning to the machine learning models that have been around for decades. The models now have access to a sufficient quantity of data that can be given as input to the Machine Learning algorithms so that they can come up with outputs useful to organizations.They have found applications in different sectors starting from Banking, Manufacturing, and to different Tech industries.Learn about the following concepts in Big Data to enrich your knowledge about the technologies used:What is Big Data and its ecosystem?What is Hadoop?What is Spark?Recommended exerciseAs an exercise, install a local version of Hadoop or Spark and upload data to run processes. Extract the results, study them, and find different ways to improve them.Work on a Machine Learning projectFinally, working on a Machine Learning project is very crucial as it helps to demonstrate your knowledge and skills on the subject. Since you are a beginner, start with a sample machine learning project like a social media sentiment analysis with the help of Facebook or Twitter.  Some of the topics you can cover under this section are:How to collect, clean, and prepare data?What is Exploratory Data Analysis?How to create and select a model?The steps you need to follow while working on a Machine Learning project are:Deciding what problem, you want to solve.Deciding the required parameters.Choosing the correct training data.Deciding the right algorithms.Writing the code.Checking the results.Advanced Machine Learning coursesThe Internet has a plethora of different sources and materials where you can start learning Machine Learning. Some of the most popular courses on Machine Learning along with Certifications are:Stanford’s Machine Learning CourseHarvard’s Data Science CourseMachine Learning by fast.aiDeep Learning Course by deeplearning.aiEdx Machine Learning CourseGet started with the FoundationsMachine Learning is an expanding field and having a set of skills on Machine Learning is an investment for the future.You can establish a firm foundation with the Machine Learning with Python course, where you will study machine learning techniques and algorithms, programming best practices, python coding, and more. This foundations course is intended to help developers of all skill levels get started with machine learning.Machine Learning is an area where learning will never stop and if you plan your journey of becoming a Machine Learning expert in a well-rounded manner, you will indeed realize the next steps to rapidly propel your learning curve.
Rated 4.5/5 based on 2 customer reviews
2434
How to Get the Best Out of Your Machine Learning C...

As a programmer, you understand well how a program... Read More

Trending Specialization Courses in Data Science

Data scientists, today are earning more than the average IT employees. A study estimates a need for 190,000 data scientists in the US alone by 2021. In India, this number is estimated to grow eightfold, reaching $16 billion by 2025 in the Big Data analytics sector. With such a growing demand for data scientists, the industry is developing a niche market of specialists within its fields.  Companies of all sizes, right from large corporations to start-ups are realizing the potential of data science and increasingly hiring data scientists. This means that most data scientists are coupled with a team, which is staffed with individuals with similar skills. While you cannot remain a domain expert in everything related to data, one can be the best at the specific skill or specialization that they were hired for. Not only thisspecialization within data science will also entail you with more skills in paper and practice, compared to other prospects during your next interview. Trending Specialization Courses in Data Science One of the biggest myths about data science is that one needs a degree or Ph.D. in Data Science to get a good job. This is not always necessary. In reality, employers value job experience more than education. Even if one is from a non-technical background, they can pursue a career in data science with basic knowledge about its tools such as SAS/R, Python coding, SQL database, Hadoop, and a passion towards data.  Let’s explore some of the trending specializations that companies are currently looking out for while hiring data scientists: Data Science with Python Python, originally a general-purpose language, isan open-source code and a common language for data science. This language has a dedicated library for data analysis and predictive modeling, making it a highly demandeddata science tool. On a personal level, learning data science with python can also help you produce web-based analytics products.  Data Science with R A powerful language commonly used for data analysis and statistical computing; R is one of the best picks for beginners as it does not require any prior coding experience. It consists of packages like SparkR, ggplot2, dplyr, tidyr, readr, etc., which have made data manipulation, visualization, and computation faster. Additionally, it also has provisions to implement machine learning algorithms. Big Data analytics Big data is the most trending of the listed specializations and requires a certain level of experience. It examines large amounts of data and extracts hidden patterns, correlations, and several other insights. Companies world-over are using it to get instant inputs and business results. According to IDC, Big Data and Business Analytics Solutions will reach a whopping $189.1 billion this year. Additionally, big data is a huge umbrella term that uses several types of technologies to get the most value out of the data collected. Some of them include machine learning, natural language processing, predictive analysis, text mining, SAS®, Hadoop, and many more.  Other specializations Some knowledge of other fields is also required for data scientists to showcase their expertise in the industry. Being in the know-how of tools and technologies related to machine learning, artificial intelligence, the Internet of Things (IoT), blockchain and several other unexplored fields is vital for data enthusiasts to emerge as leaders in their niche fields.  Building a career in Data Science  Whether you are a data aspirant from a non-technical background, a fresher, or an experienced data scientist – staying industry-relevant is important to get ahead. The industry is growing at a massive rate and is expected to have 2.7 million open job roles by the end of 2020. Industry experts point out that one of the biggest causes for tech companies to lay off employees is not automation, but the growing gap between evolving technologies and the lack of niche manpower to work on it. To meet these high standards keeping up with your data game is crucial. 
Rated 4.5/5 based on 0 customer reviews
2884
Trending Specialization Courses in Data Science

Data scientists, today are earning more than the a... Read More