Machine Learning News Hubb
Advertisement Banner
  • Home
  • Machine Learning
  • Artificial Intelligence
  • Big Data
  • Deep Learning
  • Edge AI
  • Neural Network
  • Contact Us
  • Home
  • Machine Learning
  • Artificial Intelligence
  • Big Data
  • Deep Learning
  • Edge AI
  • Neural Network
  • Contact Us
Machine Learning News Hubb
No Result
View All Result
Home Machine Learning

Distance-based Algorithm: KNN. K-Nearest Neighbor (KNN) is a… | by Priya | Sep, 2022

admin by admin
September 7, 2022
in Machine Learning


K-Nearest Neighbor (KNN) is a supervised machine learning algorithm and can be used for both classification and regression problems but is mainly used for classification problems. It is a non-parametric algorithm, i.e. it does not make any assumption on underlying data. KNN is a lazy learning algorithm which means it simply stores the data and generalizing beyond these data is not done until an explicit request is made. It is also referred to as an instance-based learning method as it heavily relies on memory by storing all its training data.

K – Nearest Neighbor

Classification: In classification, the KNN model predicts the target class label based on which class is the most frequently occurring class near the target class label i.e. class label is assigned on the majority vote. The term majority vote is most commonly used in literature while technically it is considered as plurality voting. The term majority voting typically refers to a reference value greater than 50% for making a decision which is applicable for binary class classification(two classes) but in the case of multiple classes, it’s not required a majority to make a prediction via KNN. For example, if there are three classes, the frequency greater than 33.33% is enough to assign a class label.

However before classification is done, the distance must be defined and for that various distance matrices are present but the most common is Euclidean distance.

Regression: The concept of KNN for regression is similar to classification, first we find the k nearest neighbours in the dataset and then we make predictions based on the labels of the k nearest neighbours. However, in regression, the target is a continuous value and we have to take an average of the k nearest neighbours instead of plurality voting.

Compute KNN: The main objective of the k-nearest neighbour algorithm is to identify the nearest neighbour of the target data point and assign a class label to that point. In order to identify which data points are closest to the target data point, distance should be calculated. There are several types of distance matrices but we are going to discuss only the following:

  1. Euclidean Distance
  2. Manhattan Distance
  3. Minkowski Distance
  4. Hamming Distance
  5. Cosine Distance

Euclidean Distance: This is the most commonly used distance metric used for calculating the distance between two consecutive points.

d is the length of the shortest line from X1 to X2. Euclidean distance between two vectors or points is the L2 norm of two vectors.

Euclidean distance formula
import numpy as np
X = np.array((1,2,3))
Y = np.array((1,1,4))
euc_distance = np.linalg.norm(X-Y)
print(euc_distance)

Manhattan Distance: It is also known as taxicab distance or city block distance.

Manhattan Distance

In two dimensions, the Manhattan distance between two points (x1,y1) and (x2,y2) is |x1-x2|+|y1-y2|. It is the sum of absolute values of the difference between both sets of coordinates. Manhattan distance is the L1 norm between two vectors.

Manhattan Distance Formula

Minkowski Distance: It is a distance measured between two points in N-dimensional space. It is a generalization of Euclidean distance and Manhattan distance. It is Lp norm of two vectors.

Minkowski Distance Formula

when p = 1 we get Manhattan distance and when p = 2 we get Euclidean distance.

Hamming Distance: It is mostly used in text processing. It has values in the form of boolean vectors and identifies the points where vector values don’t match. Suppose we have two points X1 and X2 and their boolean representation is as follows:

Hamming distance for the above vector is 4(where X1 and X2 values are different). Let’s consider two strings “ABCDE” and “AFGHE” of the same length and we have to find the Hamming distance between these two strings. We can compare alphabet by alphabet in both strings and we can see that only the first and last are similar, ABCDE and AFGHE. So the Hamming distance between these two strings is 3.

Cosine Distance: It is used to calculate the similarity between two vectors. It is mostly used to measure document similarity in text analysis. If θ = 0° then both A and B vectors will overlap each other which means both are similar while if θ = 90° then A and B vectors are dissimilar. If similarity decreases then the distance between two vectors will increase and vice versa.

Cosine Distance [source]

How to decide the value of k in KNN?

There is no such particular way to decide the value of k, we have to try some values to find the best out of it. The most preferred value for k is 5. Lower values of k can have high variance, but low bias and larger values of k may lead to high bias and lower variance. It is recommended to use the odd value of k to avoid any tie situation in classification. The value of k also depends upon the dataset, data having more outliers and noise performs better with a higher value of k.

Advantages of KNN

  • Simple and easy to implement
  • The value of k and distance metrics are the only hyperparameters required for KNN.
  • As new training samples are added, the algorithm adjusts to account for any new data since it doesn’t require any explicit training and all training data is stored in memory.

Disadvantages of KNN

  • Since KNN is a lazy algorithm, it takes up more memory and data storage compared to other classifiers.
  • Sensitive to outliers
  • Prone to overfitting

Applications of KNN

  1. Text Mining
  2. Recommendation System
  3. Facial Recognition

Practical Implementation of KNN

# Import necessary libraries
import numpy as np
import pandas as pd
# Load the dataset
df = pd.read_csv('diabetes.csv')
df.head()
df.info()
# Declaration of feature vector and target column
X = df.drop('Outcome',axis=1).values
y = df['Outcome'].values
# split X and Y into training and test set
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.20, random_state = 21,stratify=y)
from sklearn.neighbors import KNeighborsClassifier
knn = KNeighborsClassifier(n_neighbors=5)
knn.fit(X_train,y_train)
knn.score(X_test,y_test)

Summary

To summarize, in this article we learned about the KNN algorithm, various types of distance metrics used in the KNN algorithm, its advantages and disadvantages and its applications. For complete practical implementation of KNN Classification visit my Github repository.

About the Author: I am Priya, currently working as a Data Scientist and possess knowledge of Exploratory Data Analysis, Machine Learning and Natural Language Processing. If you have any questions please connect with me on my Linkedin profile.



Source link

Previous Post

Improve transcription accuracy of customer-agent calls with custom vocabulary in Amazon Transcribe

Next Post

Optimal decision on wait-or-walk if your bus is not coming | by Robert Dochow | Aug, 2022

Next Post

Optimal decision on wait-or-walk if your bus is not coming | by Robert Dochow | Aug, 2022

Détourage automatique - photos voitures - Ibragim Abubakarov

Mastering the evaluation of classification models with storytelling | by Aurélie Giraud | Aug, 2022

Related Post

Artificial Intelligence

Dates and Subqueries in SQL. Working with dates in SQL | by Michael Grogan | Jan, 2023

by admin
January 27, 2023
Machine Learning

ChatGPT Is Here To Stay For A Long Time | by Jack Martin | Jan, 2023

by admin
January 27, 2023
Machine Learning

5 steps to organize digital files effectively

by admin
January 27, 2023
Artificial Intelligence

Explain text classification model predictions using Amazon SageMaker Clarify

by admin
January 27, 2023
Artificial Intelligence

Human Resource Management Challenges and The Role of Artificial Intelligence in 2023 | by Ghulam Mustafa Shoaib | Jan, 2023

by admin
January 27, 2023
Deep Learning

Training Neural Nets: a Hacker’s Perspective

by admin
January 27, 2023

© 2023 Machine Learning News Hubb All rights reserved.

Use of these names, logos, and brands does not imply endorsement unless specified. By using this site, you agree to the Privacy Policy and Terms & Conditions.

Navigate Site

  • Home
  • Machine Learning
  • Artificial Intelligence
  • Big Data
  • Deep Learning
  • Edge AI
  • Neural Network
  • Contact Us

Newsletter Sign Up.

No Result
View All Result
  • Home
  • Machine Learning
  • Artificial Intelligence
  • Big Data
  • Deep Learning
  • Edge AI
  • Neural Network
  • Contact Us

© 2023 JNews - Premium WordPress news & magazine theme by Jegtheme.