[ad_1]

## An overview of how SHAP feature contributions are calculated

Suppose you (Player 1) and a friend (Player 2) enter a Kaggle contest. You end up winning the first prize of **$10,000**. Now, you want to split the money fairly. Your friend suggests that you just split it equally. However, your hyperparameter tuning skills are superior. You believe you deserve a larger share as you contributed more to the team. With this in mind, how can you divide the money up fairly?

Conveniently, your friend owns a time machine. You each go back in time and redo the Kaggle contest alone. You end up coming second and make **$7,500**. Your friend only comes third and makes **$5,000**. If none of you played you wouldn’t make any prize money (**$0**). We record the value of these 4 coalitions of players below. Clearly, you deserve more of the prize money but it’s still not clear exactly how to split it up.

One way to do it is by calculating the **expected marginal contribution **of each player. This is the weighted average of a players contributions to all the coalitions that the player could join.

For example, Player 1 (P1) could join a coalition of only Player 2 (P2). P2 goes from third place to first place and the prize money increases by **$5000**. P1 could also join a coalition of no players and increase the prize money by **$7,500**. These are the marginal contributions of P1. The average of these gives us an expected marginal contribution of **$6,250**.

We can follow a similar process to calculate the expected marginal contribution for P2. This time the value is **$3,750**. In the end, P1 will receive $6,250 and P2 will receive $3,750. These two values are also known as **Shapley values**. Notice the values add up to the total prize money of $10,000.

**Shapley values **are considered a fair way to divide the prize money. It is a concept that comes from game theory. We’ve only calculated them for a team of 2 players. We are able to use them to calculate the fair distribution for a team of any size. We will spend time understanding this **generalised Shapley value formula**. The formula may seem scary. However, when we get under the hood you will see that it has an intuitive explanation.

Shapley value = expected marginal contribution

This is the weighted average of a players contributions to all the coalitions that the player could join.

It may seem like a big jump from dividing prize money to explaining machine learning models. Yet, Shapley values can be used to understand how each model feature (players) has contributed to a prediction (prize money). We will explain how Shapley values are **extended to explain model predictions**. We will end by exploring the **contributions of ****SHAP** to this area of research. That is they have drastically increased the speed at which we can approximate Shapley values.

Before we move on, let’s walk through another example. This will make the general equation easier to understand. This time we have a team of **3 players**. The calculations are complicated and we’ll have to go back in time a few more times.

There will now be 8 possible coalitions seen below. Together you win first prize (**$10,000**). There are now 3 coalitions of 2 players. For example, a coalition of P1 and P3 would come second (**$7,500**). There will also be coalitions of 1 Player. For example, if P3 played alone they would not make any prize money (**$0**). Perhaps they should have invested in a better GPU.

We can use these coalition values to calculate the Shapley value for P1. There are now 4 coalitions that P1 could join. P1 can join a coalition of both P2 and P3, a coalition of only P2 or P3 or a coalition of no players. Like before, we calculate the marginal contributions of P1 to each of these coalitions. Finally, we take the weighted average. This gives us a Shapley value of **$5,000**.

The question you may be asking is where do we get these weights from? That is why do we weight the first marginal contribution by 1/3, the second by 1/6 and so on… These are the probabilities that P1 makes those specific contributions. Weighting by probabilities gives us an ** expected** marginal contribution.

It’s not obvious where these probabilities come from. To start, we need to work out the number of ways a coalition of 3 people can be formed. This is because the full prize money ($10,000) can only be won if all 3 members work together.

To do this, we assume each member joins the team sequentially with equal chance. For example, P1 joins then P3 then P2. This gives us a total of 3! = 6 ways of forming the coalition. You can see all of these below. In general, there are n! ways of forming a team of n players.

Above we saw that if P1 joins a coalition of P2 and P3 they will make a marginal contribution of** $5000**. This can happen in two ways. Either P2 joins then P3 then P1 or P3 joins then P2 then P1. In other words, P1 will make this marginal contribution for 2 of the 6 ways the team can form. This gives us a probability of 2/6 = 1/3 that P1 makes this contribution.

There is only one way that P1 can make the second contribution (**$2,500**). That is if P2 joins then P1 then P3. This gives us a probability of 1/6. Similarly, the third contribution (**$7,500**) has a probability of 1/6. The fourth contribution (**$5000**) has a probability of 1/3. Like with the first contribution there are two ways for P1 to make it. First P1 joins then P2 then P3 or P3 then P2.

We can follow the same process for P2 and P3. For these players, the Shapley values turn out to be **$3,750** and **$1,250** respectively. Again, all the Shapley values add up to the total prize money. Shapley values will always divide **all** the prize money fairly. Now let’s see how we can generalise the Shapley value.

**Equation 1** gives the formula for the Shapley value of** player i **of a **p** player game. Starting with the summation sign, we are summing over all coalitions S. Where S is the subset of coalitions that do not include Player i. In other words, S contains all the coalitions to which Player i is able to make a marginal contribution. Going back to the 3 player example, there were 4 coalitions that did not include P1.

Within the square bracket, we have the marginal contribution of player i to coalition S. Specifically, we have the value (val) of coalition S including Player i less the value of coalition S. The value function depends on the particular game that is being played. In our 3 player example, we used different notation. We spoke about coalition values and used the letter C with player subscripts. These coalition values give the value function for the Kaggle game.

Lastly, we weight the marginal contributions. Below, you can see what each of the components of the weight represents. Here **|S| **is the number of players in coalition S. This means **p-|S|-1** is the number of players that need to join the coalition after player i.

In the weights numerator, we have the number of ways the coalition S can form. The denominator is the number of ways the full team can form. So, the weight gives the probability that player i makes a contribution to a coalition of size |S| when there are p players in the game. If you sub in the values for our 3-player game you would get the same weights as before.

Breaking down the Shapley value, you can see that it has an intuitive explanation. We are weighting all the marginal contributions of player i by the probabilities that they make the contributions. We then sum these weighted contributions over all the coalitions that player i can join. This gives us an **expected marginal contribution**. Using these values, we can divide the total value of the game among all the players.

Intuitively, it may seem that the **expected marginal contribution **is a fair way to do this. We consider contributions to all coalitions. This means we take into account a player’s individual contribution and the interactions between players. That is some players may work well together increasing their joint value. The problem is there could be other ways to divide value that also seem fair. We need to prove the Shapley value is fair.

## Shapely axioms

The Shapley value is actually derived from 3 axioms. We will only summarise them but they can also be defined mathematically. These axioms can be considered a definition of fairness. Hence, a method of dividing value that satisfies this definition can be considered fair.

**Symmetry** Two players are considered interchangeable if they make the same contributions to all coalitions. If two players are interchangeable then they must be given an equal share of the game’s total value.

**Null player property** If a player makes zero marginal contribution to all coalitions then they get none of the total value.

**Additivity** If we combine two games, then a player’s overall contribution is the sum of the contributions for the two individual games. This axiom makes the assumption that any games played are independent.

We can prove mathematically that the Shapley value is the only efficient value that satisfies these 3 axioms. By efficient, we mean none of the game’s value is left over. Ultimately, under this definition, the Shapley value is the only fair way to divide value. It is amazing that such an intuitive formula can be derived from 3 simple axioms.

We can use Shapley values to understand how a model has made a prediction. Now, the value of the game is the model **prediction**. The **feature values** are the players. To be clear, it is not the features but the ** values** of the features for a particular observation that play the game. However, we will refer to these as the features. We use Shapley values to calculate how each of the features has contributed to the prediction.

For example, in a previous article, we train a model to predict the age of abalone. You can see the Shapley values for a particular observation in **Figure 1**. These give the difference between the average predicted age E[f(x)] and the observation’s prediction f(x). For example, the value for **shucked weight **has increased the predicted age by 1.81. We refer to this as the feature’s contribution.

To calculate these we can use the same Shapley value formula as before. All we need to do is change the value function. **Equation 2** gives the function we use for a particular observation **x**. Here S is a coalition of feature values, **f** gives the model predictions and the model has **p** features. The value of **S** is the model’s prediction marginalised over all features that are not in S. We use the actual values for those that are in S.

We are actually doing multiple integration in the above equation. To marginalise over a feature, we integrate the prediction function w.r.t. the probability of the feature’s values. We do this for all features in S. To do this we need to know the feature distributions or use the empirical distributions.

There are a lot of moving parts in the above formula. In the next part of this section, we will go over a ML example to better understand it. We will then move on to discussing how we can **approximate Shapley values** for ML. To end this section, we will discuss the **properties of Shapley values**. These follow from the axioms and we will discuss what they mean in the context of machine learning.

## ML Example

Suppose we want to predict someone’s income. We have two features — age (feature 1) and degree (feature 2). We end up with the model **f** below. For age, we assume the feature is uniformly distributed between 18 and 60. Similarly, for degree, we assume there is an equal chance that someone will have a degree (1) or not (0). For our observation, we have a person aged 20 with a degree. The model will predict that this person has an income of $5,000.

We can calculate the marginal contribution of feature 2 {2} to a coalition of feature 1, S = {1}. This could be used to help calculate the Shapley value for feature 2. In other words, to calculate the contribution of feature 2 (degree=1) to the prediction.

We start by calculating the value of a coalition of both features, S = {1,2}. This is relatively straightforward. S contains both features so we do not have to marginalise over any features. We can use the feature’s actual values. As seen below, this is the same as the prediction for this observation.

We then need to calculate the value of coalition, S = {1}. In this case, S does not contain {2}. We marginalise over feature 2 and use the actual value for feature 1. Remember, feature 2 is not continuous. To marginalise over the feature’s values we do not need to use integration. We just sum the prediction at each value times the probability of that value (i.e. 50%).

We can now calculate the marginal contribution of feature 2 to S={1}. The weight of this contribution is calculated in the same way as standard Shapley values. To be clear, we use the number of features in the coalition and total number of feature values. We have |S| = |{1}| = 1 and p = 2. This gives us a weight of (1!)(0!)/2! = 1/2.

This only gives us part of the calculation needed for the Shapley value of feature 2. We will also need to calculate the marginal contribution of feature 2 to S={}. To do this we need to calculate the value function of S = {}. This would require us to marginalise over the distributions of both features.

## Approximation of Shapely values

Calculating exact Shapley values is computationally expensive. In the above example, we only had two features. As we add more features, the number of possible coalitions increases exponentially. In practice, it is only feasible to approximate the Shapley values.

One way to do this is using Monte-Carlo sampling. For feature i, we will first calculate the prediction with the feature value (+i). We do the same without the value (-i). That is we take a random value for feature i. The remaining feature values will also all be randomly sampled. We take the difference between these two predictions. Seen below, we do this M times and find the average of all of these differences. By randomly sampling and averaging, we implicitly weight by the distribution of the features.

The above process can still be impractical. We may need many samples to get a reasonable approximation of a Shapley value. This is where SHAP comes in. As we discuss in the last section, it is a faster way of approximating Shapley values. Before we move on to that, we will discuss the properties of Shapley values in the context of machine learning.

## Properties of Shapley values

Shapley is one method of explaining predictions. It is popular because of its desirable properties. Most of these follow from the axioms of Shapley values.

**Efficiency **As mentioned, Shapley values are efficient. Before, this meant the full value of a game is divided amongst its players. For ML, this means the prediction is divided amongst the features. Specifically, the Shapley values satisfy the equation below. The sum of all Shapley values and the average predicted value is equal to the observation’s prediction. We saw this earlier in **Figure 1**.

Another popular local interpretation method is LIME. In comparison, LIME is not necessarily efficient. The weights calculated will not add up to the original prediction. For Shapley, we know how much each feature has contributed to a prediction. For LIME, we only know which feature was most important for that prediction.

**Symmetry **Two features will have the same Shapley value if they make the same contributions to all coalitions.

**Dummy** A feature will have a Shapley value of 0 if it never changes the prediction. In other words, features that are not used in a model will not have a Shapley value.

**Additivity **Shapley values for machine learning are also additive. This is only relevant for ensemble models. In this case, the overall Shapley value can be calculated by taking the weighted average of the Shapley values for each model in the ensemble. Where the weight will be the same as the weight given to the predictions of each model. For example, in a random forest, the prediction from each decision tree is given an equal weight.

**Consistency **This property follows from the previous 3 properties. Suppose we change a model from M1 to M2. If a feature now increases a prediction more than before then its Shapley value will increase. This means we can reliably compare the feature contributions of different models.

The SHAP python package has become synonymous with working with Shapley values. The key to the wide implementation of this package is the **speed** at which it can make approximations. We discuss a few of the methods below. The increased speed means we are also able to calculate many Shapley values. This has allowed for different aggregations of the values that give us a global view of the model.

## KernelSHAP

Kernel SHAP reframes the Shapley value as parameters in a linear model. To put it simply the approximation method works by first permuting feature values. After enough permutations, the Shapley values are estimated jointly using linear regression. Estimating the values together requires fewer computations than other sampling approaches. Such as Monte-Carlo sampling where the Shapley values for each feature is calculated individually.

KernelSHAP is also a model agnostic method of approximating Shapley values. This means it can be used with any model. That is provided SHAP has been implemented for your modelling package.

## TreeSHAP

An approximation method that is not model agnostic is TreeSHAP. It takes advantage of the structure of the individual trees in the ensemble models. As a result, it can only be used with tree-based algorithms like random forests and XGBoost. The advantage of TreeSHAP is that it is significantly faster than KernelSHAP.

With KernelSHAP, we can estimate Shapley values in exponential time w.r.t to the number of features. Whereas TreeSHAP can estimate them in linear time. We discuss this difference in detail in the article below. We also explore how other aspects of your model can affect the approximation time. This includes the number of trees in the ensemble, the maximum depth and the number of leaves.

As mentioned, these methods make it possible to approximate a large number of Shapley Values. We can combine them in different ways to gain an understanding of how the model works as a whole. One example is the beeswarm plot given in **Figure 2**.

Here we group the values for each feature (e.g. shell weight). We can see the features that tend to have large positive and negative Shapley Values. These are the features that tend to make significant contributions to the predictions. We also colour the points by the feature’s value. Doing so we can start to understand the relationship between the features and the target variable.

The easy implementation of these types of plots is another reason the SHAP package has been widely adopted. We explore how to use this package in the article below. We discuss the Python code and we explore some of the other aggregations provided by the package.

[ad_2]

Source link