K-Nearest Neighbor (KNN) is a supervised machine learning algorithm and can be used for both classification and regression problems but is mainly used for classification problems. It is a non-parametric algorithm, i.e. it does not make any assumption on underlying data. KNN is a lazy learning algorithm which means it simply stores the data and generalizing beyond these data is not done until an explicit request is made. It is also referred to as an instance-based learning method as it heavily relies on memory by storing all its training data.
Classification: In classification, the KNN model predicts the target class label based on which class is the most frequently occurring class near the target class label i.e. class label is assigned on the majority vote. The term majority vote is most commonly used in literature while technically it is considered as plurality voting. The term majority voting typically refers to a reference value greater than 50% for making a decision which is applicable for binary class classification(two classes) but in the case of multiple classes, it’s not required a majority to make a prediction via KNN. For example, if there are three classes, the frequency greater than 33.33% is enough to assign a class label.
However before classification is done, the distance must be defined and for that various distance matrices are present but the most common is Euclidean distance.
Regression: The concept of KNN for regression is similar to classification, first we find the k nearest neighbours in the dataset and then we make predictions based on the labels of the k nearest neighbours. However, in regression, the target is a continuous value and we have to take an average of the k nearest neighbours instead of plurality voting.
Compute KNN: The main objective of the k-nearest neighbour algorithm is to identify the nearest neighbour of the target data point and assign a class label to that point. In order to identify which data points are closest to the target data point, distance should be calculated. There are several types of distance matrices but we are going to discuss only the following:
- Euclidean Distance
- Manhattan Distance
- Minkowski Distance
- Hamming Distance
- Cosine Distance
Euclidean Distance: This is the most commonly used distance metric used for calculating the distance between two consecutive points.
d is the length of the shortest line from X1 to X2. Euclidean distance between two vectors or points is the L2 norm of two vectors.
import numpy as np
X = np.array((1,2,3))
Y = np.array((1,1,4))
euc_distance = np.linalg.norm(X-Y)
Manhattan Distance: It is also known as taxicab distance or city block distance.
In two dimensions, the Manhattan distance between two points (x1,y1) and (x2,y2) is |x1-x2|+|y1-y2|. It is the sum of absolute values of the difference between both sets of coordinates. Manhattan distance is the L1 norm between two vectors.
Minkowski Distance: It is a distance measured between two points in N-dimensional space. It is a generalization of Euclidean distance and Manhattan distance. It is Lp norm of two vectors.
when p = 1 we get Manhattan distance and when p = 2 we get Euclidean distance.
Hamming Distance: It is mostly used in text processing. It has values in the form of boolean vectors and identifies the points where vector values don’t match. Suppose we have two points X1 and X2 and their boolean representation is as follows:
Hamming distance for the above vector is 4(where X1 and X2 values are different). Let’s consider two strings “ABCDE” and “AFGHE” of the same length and we have to find the Hamming distance between these two strings. We can compare alphabet by alphabet in both strings and we can see that only the first and last are similar, ABCDE and AFGHE. So the Hamming distance between these two strings is 3.
Cosine Distance: It is used to calculate the similarity between two vectors. It is mostly used to measure document similarity in text analysis. If θ = 0° then both A and B vectors will overlap each other which means both are similar while if θ = 90° then A and B vectors are dissimilar. If similarity decreases then the distance between two vectors will increase and vice versa.
How to decide the value of k in KNN?
There is no such particular way to decide the value of k, we have to try some values to find the best out of it. The most preferred value for k is 5. Lower values of k can have high variance, but low bias and larger values of k may lead to high bias and lower variance. It is recommended to use the odd value of k to avoid any tie situation in classification. The value of k also depends upon the dataset, data having more outliers and noise performs better with a higher value of k.
Advantages of KNN
- Simple and easy to implement
- The value of k and distance metrics are the only hyperparameters required for KNN.
- As new training samples are added, the algorithm adjusts to account for any new data since it doesn’t require any explicit training and all training data is stored in memory.
Disadvantages of KNN
- Since KNN is a lazy algorithm, it takes up more memory and data storage compared to other classifiers.
- Sensitive to outliers
- Prone to overfitting
Applications of KNN
- Text Mining
- Recommendation System
- Facial Recognition
Practical Implementation of KNN
# Import necessary libraries
import numpy as np
import pandas as pd
# Load the dataset
df = pd.read_csv('diabetes.csv')
# Declaration of feature vector and target column
X = df.drop('Outcome',axis=1).values
y = df['Outcome'].values# split X and Y into training and test set
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.20, random_state = 21,stratify=y)from sklearn.neighbors import KNeighborsClassifier
knn = KNeighborsClassifier(n_neighbors=5)
To summarize, in this article we learned about the KNN algorithm, various types of distance metrics used in the KNN algorithm, its advantages and disadvantages and its applications. For complete practical implementation of KNN Classification visit my Github repository.
About the Author: I am Priya, currently working as a Data Scientist and possess knowledge of Exploratory Data Analysis, Machine Learning and Natural Language Processing. If you have any questions please connect with me on my Linkedin profile.