[ad_1]

*Deep dive in the modelling assumptions and their implications*

While digging in the details of classical classification methods, I found sparse information about the similarities and differences of Gaussian Naive Bayes (GNB), Linear Discriminant Analysis (LDA) and Quadratic Discriminant Analysis (QDA). This post centralises the information I found for the next learner.

Summary: All three methods are a specific instance of The Bayes Classifier, they all deal with continuous Gaussian predictors, they differ in the assumptions they makes about the relationships amongst predictors, and across classes (i.e. the way they specify the covariance matrices).

## The Bayes Classifier

We have a set X of ** p** predictors, and a discrete response variable Y (the class) taking values k = {1, …, K}, for a sample of

**observations.**

*n*We encounter a new observation for which we know the values of the predictors X, but not the class Y, so we would like to make a guess about Y based on the information we have (our sample).

The Bayes classifier assigns the test observation to the class with the highest conditional probability, given by:

Where: pi_k is the prior estimate, f_k (x) is our likelihood. To obtain probabilities for class k, we need to define formulae for the prior and the likelihood.

**The prior. **The probability of observing class k, i.e. that our test observation belongs to class k, without having further information on the predictors. Looking at our sample, we can think of the cases in class k as realisations from a random variable with Binomial distribution:

where for ** n **trials, in each trial the observation either belongs (success) or does not belong (failure) to class k. It can be shown that the relative frequency of successes — the number successes over the total number of trials —

**is an unbiased estimator for pi_k. Hence, we use the relative frequency as our prior for the probability that an observation belongs to class k.**

**The likelihood: **The likelihood is the probability of seeing these values for X, given that the observation actually belongs to class k. Hence, we need to find the distribution of the predictors X in class k. We don’t know what the “true” distribution is, so we can’t “find” it, we rather make some reasonable assumptions about how it might look like, and then use our sample to estimate its parameters.

How to choose a reasonable distribution? One clear division arises between predictors that are discrete and continuous. All three methods assume that within each class,

Predictors have a Gaussian distribution (p=1) or Multivariate Gaussian (p>1).

Hence, these algorithms can be used only when we have continuous predictors. In fact, Gaussian Naive Bayes is a specific case of general Naive Bayes, with a Gaussian likelihood, reason why I’m comparing it with LDA and QDA in this post.

From now on, we’ll consider the simplest case able to showcase the differences between the three methods: two predictors (p=2) and two classes (K=2).

## Linear Discriminant Analysis

LDA assumes that the covariance matrix across classes is the same.

Meaning that predictors in class 1 and class 2 might have different means, but their variance and covariance is the same. Meaning that the “spread” and relationship between predictors is the same across classes.

The plot above was generated from distributions for each class of the form:

where we observe that the covariance matrices are the same. This assumption is reasonable if we expect the relationship between predictors to not change across classes, and if we simply observe a shift in the means of the distributions.

## Quadratic Discriminant Analysis

If we relax the constant covariance matrix assumption of LDA, we have QDA.

QDA does not assume constant covariance matrix across classes.

The plot above was generated from distributions for each class of the form:

Where we observe that the two distribution are allowed to vary in all the parameters. This is a reasonable assumption if we expect the behaviour and relationships amongst predictors to be very different in different classes.

In this example, even the direction of the relationship between the two predictors varies from class 1 to class 2, from a positive covariance of 4, to a negative covariance of -3.

## Gaussian Naive Bayes

GNB is a specific case of the Naive Bayes, where the predictors are continuous and normally distributed within each class k. The general Naive Bayes (and hence , GNB too) assumes:

Given Y, the predictors X are conditionally independent.

independent implies uncorrelated, i.e. have covariance equal to zero.

The plot above was generated from distributions for each class of the form:

With Naive Bayes we are assuming there is no relationship between predictors. In real problems this is rarely the case, nevertheless it considerably simplifies the problem, as we will see in the next section.

## Implications of Assumptions

After selecting a model, we estimate the parameters of the within-class distributions to determine the likelihood of our test observation, and obtain the final conditional probability we use to classify it.

The different models result in a different number of parameters being estimated. Reminder: we have ** p **predictors and

**total classes. For all models we need to estimate means of the Gaussian distribution of the predictors, that can be different in each class. This results in a base,**

*K******

*p***parameters to be estimated for all methods.**

*K*Additionally, if we pick LDA we estimate the variances for all p predictors and covariances for each pair of predictors, resulting in

parameters. These are constant across classes.

For QDA, since they differ in each class, we multiply the number of parameters for LDA times K, resulting in the following equation for the estimated number of parameters:

For GNB, we only have the variances for all predictors in each class: **p*K**.

It is easy to see the advantage of using GNB for large values of p and/or K. For the frequently occurrent problem of binary classification, i.e. when K=2, this is how the model complexity evolves for increasing p for the three algorithms.

## So What?

From a modelling perspective, knowing what assumptions you’re dealing with is important when applying a method. The more parameters one needs to estimate, the more sensitive is the final classification to changes in our sample. At the same time, if the number of parameters is too low, we’ll fail to capture important differences across classes.

Thank you for your time, I hope it was interesting.

*All images unless otherwise noted are by the author.*

[ad_2]

Source link