K-Nearest Neighbor is one of the supervised models which is applicable for both regression and classification problems and is very easy to implement. In this algorithm the distance from K nearest neighbors are determined and the maximum number of classes it forms with respect to distance is classified. There are two types of distances
If there are two points in the plane (x1,y1) and (x2,y2).
Euclidean Distance is
And the Manhattan Distance is
KNN is an algorithm that classifies a sample based on the classes of the samples that comes close to it. Here, K is the number of nearest neighbors.
In the above example shown in picture, if K=1, we search for first nearest neighbor and its class is considered which is a circle, hence our sample will be classified as circle. On the other hand, if we choose K=3, we have two neighboirs as traingle whereas one neighbor is circle. Since we have more triangle classes, the sample would be classified as triangle. From this, we can dervie, with different values of K, we will have different outcomes. So the best way to determine K value is by performing cross validation and minimizing the error.
For regression, we follow similar approach, just instead we calculate the arithmetic mean of K nearest neighbors point.
One thing to remmember is KNN is bad for outliers and imbalanced datasets. Hence while datacleansing, we have to identify and remove any outliers in the data. Otherwise, their presence could cause miscalculations of samples.
It is non-parametric, hence easily adjustable to new data also the hyperparameter tuning is straightforward. Whereas it needs more data to make good predictions, hence fitting process could take up too much memory, also the testing could be very slow due to larger datasets. It is not a preferable choice for datasets with categorical features though.