Suicide is a significant public health concern that affects individuals, families, and communities. In this project, we will explore the trends and patterns in suicide rates from 1985 to 2021. I will use a sample of 44,000 data points gathered from 141 different countries. Through our analysis, we will gain a better understanding of the factors that contribute to suicide and the populations that are most at risk. There are many other factors deciding why gender plays such a big role. For example, we know that men are more likely to supress their feelings and are on average more violent, leading them to opt for more violent solutions to their problems. With age comes loneliness, the older you are and if you don’t have a family, the higher chance that the people you love will be out of your life by then.
In this assignment we aim to create a complete ML project with
- Checking Data Quality
- Feature Selection
- Modeling — Training Models
- Selecting Best Model,
- Hyperparameter Tuning
- Model Interpretability
- Reports and Visualizations
What is human-level performance on that task? What level of performance is needed?
- ML techniques can identify patterns and relationships in the data, which can help in understanding the factors that contribute to suicide and developing strategies to prevent it.
- Human experts in suicide prevention and mental health can also analyze such data, but the level of performance needed for the ML model depends on the goals of the analysis.
- The specific methodology and criteria used to assess performance make it difficult to define human-level performance on this task. Ultimately, the level of performance needed will depend on the specific context and goals of the analysis.
#Creating binary vaiables for sex data4 = pd.get_dummies(data3)
# Normalizing the data in the rings column beacuse the value is too high when compared to other independent variable
from sklearn import preprocessing
# Create x to store scaled values as floats
x = data4[['suicides_no']].values.astype(float)
# Preparing for normalizing
min_max_scaler = preprocessing.MinMaxScaler()
# Transform the data to fit minmax processor
x_scaled = min_max_scaler.fit_transform(x)
# Run the normalizer on the dataframe
data4[['suicides_no']] = pd.DataFrame(x_scaled)
- Creates binary variables for categorical variable ‘sex’ using get_dummies(). Normalizes ‘suicides_no’ column using MinMaxScaler from scikit-learn’s preprocessing module
- Extracts original values in ‘suicides_no’ column into numpy array x. Scales numpy array x using MinMaxScaler object min_max_scalerPuts scaled values back into ‘suicides_no’ column of data4
- Resulting DataFrame data4 will have new columns for ‘sex’ as binary variables and ‘suicides_no’ column scaled between 0 and 1.
Preliminary findings from the EDA.
- Datatype Check — All the features in the dataset are Integers/ Float.
- Missing Data Check — Our Dataset Didn’t had any missing values in any of the features.
- Distribution of training data — I checked the probability distribution of each and every feature in the training dataset
- Correlation Check — I performed a correlation check on the dataset using heatmap and pairplot, the findings from it was that the didn’t had any significant multi colinearity issues.
- Barplot — Performed Barplot analysis to findout how a variable impacts the number of suicides
The above graphs are a cohesive representation of how number of suicides is changing w.r.t. all the significant variables.
A few findings which we can interpret from the above graph as
- number of suicides increase as the suicides_population increase
- number of suicides increase as the population increase
These are the preliminary findings about how the individual parameters are affecting the number of suicides.
This dataset was relatively cleaner with no missing values. But handing missing values is one of the most important tasks in any Data Science project.
Lets fit a very simple linear model to understand how the features of suicides are affecting its number
This code is creating a normalized dataframe by dropping the “suicides_no” column from the original dataframe “data4”. The values of the remaining columns in the “df” dataframe are normalized to fall within the range of 0 to 1. Then, the “suicides_no” column is added back to the normalized dataframe to create “df_norm”.
import statsmodels.formula.api as smfresults = smf.ols(
"suicides_no ~ population + suicides_Population + GdpPerCapita + sex_female + sex_male + generation_Boomers + generation_Millenials + generation_Silent",
data=df_norm,
).fit()
print(results.summary())
- OLS model has suicides_no as dependent variable and population, suicides_Population, GdpPerCapita, sex_female, sex_male, generation_Boomers, generation_Millenials, and generation_Silent as independent variables.
- Summary includes model performance info, significance and coefficients of each independent variable, and statistical measures like R-squared value and p-values.
Feature Selection is the process of selecting the features which are relevant to a machine learning model. It means that you select only those attributes that have a significant effect on the models output.
Null-Hypothesis
Statistically, this can be achieved by P-Value.
P-Value — It stands for probability value tells how likely it is that a result occurred by chance alone. Basically, the p-value is used in hypothesis testing to help you support or reject the null hypothesis. The smaller the p-value, the stronger the evidence to reject the null hypothesis.
Features in our dataset with P value < 0.05
- population
- suicides_Population
- GdpPerCapita
- sex_female
results2 = smf.ols(
"suicides_no ~ population + suicides_Population + GdpPerCapita + sex_female + generation_Millenials + generation_Silent",
data=df_norm_feature_selected,
).fit()print(results2.summary())
def percentage_change(l1, l2):
percent_change = np.abs(l2 - l1) / (l1 + 0.000000001)
avg_change = np.mean(percent_change)
return avg_change
print(
"Accuracy of predicting the correct number of suicides using all features = ",
100 - percentage_change(df_norm["suicides_no"], df_norm["predicted_number_1"]),
)
print(
"Accuracy of predicting the correct number of suicides using only significant features is = ",
100
- percentage_change(
df_norm_feature_selected["suicides_no"],
df_norm_feature_selected["predicted_number_2"],
),
)
- Features selected: population, suicides_Population, GdpPerCapita, sex_female, generation_Millenials, generation_Silent. Target variable: suicides_no. Calculates predicted values using all features in original dataframe except suicides_noPredicted values rounded and saved in new column “predicted_number_1” in df_norm
- Selected features: population, suicides_Population, GdpPerCapita, sex_female, generation_Millenials, generation_Silent, suicides_no. Predicted values rounded and saved in new column “predicted_number_2” in df_norm_feature_selected
- Defines function “percentage_change” to calculate percentage change between two lists of numbers. Average percent change calculated across all elements in lists and returned as “avg_change”
- Little to no effect on outcome seen when selecting only statistically significant features. Model using all features slightly more accurate than model using only significant features (99.35% vs 99.35%)
How did you split the data into train, and test?
The data was splitted into Training ad Testing Data into 90% and 10% respectively.
This code prepares the data for training and testing by first separating the target variable “suicides_no” from the input features, which are stored in the variable “X”. Then, it splits the data into training and test sets using the “train_test_split” function from the “sklearn” library. The test size is set to 10%, and the random state is set to 42 for reproducibility.
Next, the code samples 100 data points from both the training and test sets, which will be used for SHAP analysis later. The samples are stored in the variables “x_train_100” and “x_test_100”, respectively.
Fitting a Linear Model
import sklearnlinear_model = sklearn.linear_model.LinearRegression() # Initializing a Linear Model
linear_model.fit(x_train, y_train) # Training a linear model
y_linear_predictions = linear_model.predict(x_test).round()
It then trains the model using the training data (x_train and y_train) using the fit() method. Finally, it makes predictions on the test data (x_test) using the trained model and rounds the predictions using the round() function. The predicted values are stored in the y_linear_predictions variable.
Fitting a Tree Based Model
from sklearn.ensemble import RandomForestRegressortree_model = RandomForestRegressor(
max_depth=X.shape[1], random_state=0, n_estimators=10
)
tree_model.fit(x_train, y_train)
y_tree_based_predictions = tree_model.predict(x_test).round()
This code trains a random forest regression model on the training data using the RandomForestRegressor. The max_depth parameter sets the maximum depth of each decision tree in the forest, and n_estimators sets the number of trees. The model is then used to make predictions on the test set, and the predicted values are rounded to the nearest integer using the round() method. The predicted values are stored in the variable y_tree_based_predictions.
Fitting a Support Vector Machine (SVM)
from sklearn import svmregr = svm.SVR()
svm_model = regr.fit(x_train, y_train)
svm_predictions = svm_model.predict(x_test).round()
This code is using Support Vector Regression (SVR) from scikit-learn to create a model for predicting the target variable suicides_no based on the features in x_train. First, an instance of the SVR class is created with default hyperparameters, then the fit method is called on the training data to train the model. Finally, the predict method is used to make predictions on the test data (x_test), and the output is rounded to the nearest integer. The predicted values are stored in the svm_predictions variable.
Fitting a MLP Regressor
from sklearn.neural_network import MLPRegressorregr = MLPRegressor(random_state=1, max_iter=500).fit(x_train, y_train)
mlp_predictions = regr.predict(x_test).round()
This code is importing the MLPRegressor class from the scikit-learn library to create a multi-layer perceptron regressor model. It is then initializing an instance of the MLPRegressor class with specified hyperparameters such as the random state and maximum number of iterations, and then fitting the model on the training data using the fit() method. Finally, it is making predictions on the test data using the predict() method and rounding off the predictions to the nearest integer using the round() method.
- The first function, autoML(), takes in two H2O dataframes, df_train and df_test, and performs AutoML on the training data. The function then returns the original training and test dataframes and the trained AutoML object.
- The second function, getBestModel(), takes in the trained AutoML object aml and uses the leaderboard to select the best performing model. The function creates a dictionary of model names and their corresponding index in the leaderboard, and selects the best model as the highest performing non-ensemble model or the best performing GLM model.
Overall, these functions allow for efficient and automated model selection using H2O’s AutoML functionality.
- The getBestModel() function is called with the trained AutoML object as an argument to identify the best-performing model.The resulting best-performing model is assigned to the variable autoML_model.
- The predict() function of the autoML_model object is called on the test data set to generate predictions.The .round() method is called on the predicted values to round off the values to the nearest integer.
- The resulting predicted values are assigned to the variable autoML_best_predictions for evaluating the performance of the model.The H2O frames containing the predicted and true values are converted into lists using h2o.as_list() function.
The code is splitting the dataset into three parts: training, validation, and test. Then, it sets up a hyperparameter grid for the XGBoost model using the H2OGridSearch function. The hyperparameters to be tested are ntrees, max_depth, and sample_rate. It then trains the grid using the training and validation sets and chooses the best model based on its R-squared value. Finally, the function find_best_model_from_grid is called to find the best model from the grid based on the R-squared value, and the result is stored in the best_xgboost_model variable.
The evaluation metrics which I am using are
- Mean Squared Error
- Root Mean Squared Error
- Mean Absolute Error
- Mean Residual Deviance
Evaluation metrics are used to evaluate the best model for regression problems.Mean Residual Deviance and Accuracy are important metrics used to evaluate the model.
Residual Deviance measures how well the response variable can be predicted by the model.Lower residual deviance values indicate better predictability of the response variable.
How do training, validation, and test metrics compare?
The best model (AutoML’s Hyperparameter tuned model) has done pretty well on both the Training Dataset as well as Validation Dataset.
As the Mean Residual Deviance of this model on training and validation dataset is ~2.8628 and ~1.228 respectively. Also, the other metrics on both of the dataset validates that the model is not overfitting the Training Data. Please refer below to see the Evaluation Metrics on both training and validation dataset.
best_xgboost_model
Which models did you explore and did you try to tune the hyperparameters of the best model you got?
I trained multiple models for this purpose including the simplest Linear Regression to using AutoML to get the best model it could fit. The models I trained for this purpose were
- Linear Regressor
- Random Forest Regressor
- MLP Regressor
- Support Vector Machine(SVM) Regressor
- AutoML
The models which gave me the best performance in predicting the number of suicides were the MLP and the Linear Regressor
Model Selection From the above summary of the models trained on the dataset, the two best models are MLP and Linear Model.
- MLP Regressor with an Accuracy of ~ 99.66%
- AutoML’s Hyperparameter Tuned Version ~ 99.66%
Let’s try to understand how both the models have been trained.
Interpreting MLP Regressor using SHAP values
Here the x-axis is the feature and the y-axis is the output as we vary the feature. The grey histogram is the distribution of variables in the dataset and the cross made by E[Feature], E[f(x)] is the expected values.
Let us take a feature of population
The cross is made at approx E[f(x)] 0.17. So as the fixed_acidity increases the expected value also increases
Talking about the red line on the plot — When we give a sample as an input (sample_ind = 18) as an input to check the output. By plotting this we can see the difference between the model output from the expected value.
Interpreting SHAP Feature Importance Plot for Linear and Tree-based model
The idea behind SHAP feature importance is simple: Features with large absolute Shapley values are important. Since we want global importance, we average the absolute Shapley values per feature across the data. Next, we sort the features by decreasing importance and plot them.
The following plot is
SHAP feature importance plot for MLP model
Using the below plot we can come on the conclusion that the suicide_population is the most important feature, followed by population and sex_female.
Interpreting Waterfall SHAP visualization
Let’s consider the same sample (sample_ind = 18).
It says that f(x) = 0.003 is what we got as a model output and the expected output for this sample was 0.13. We came pretty close to determining it as the difference is very less. The waterfall model explains how we got the expected output, and which features contributed to what. The below graph shows that suicide_population has the biggest and most negative impact in increasing the number of suicides by 0.08 for this specific sample. Followed by GdpPerCapita had a negative impact and it bought the number of suicides down again by 0.1 for this sample, and so on. Using this model we can visually interpret why exactly this specific sample is giving an output of 0.093
Interpret the summary plot MLP Model
Here the features are listed in descending order of their importance. This is one of the easiest ways to analyze an ML model and how the features are affecting the target and to what extent.
- Each dot(both red and blue) represents a feature of a suicidesRed color represents high values whereas blue color represents low value
- If a dot(a feature of suicides) is on the right side of the y-axis then it had a positive impact and if it is on the left side of the axis it had a negative impact
- The position of a dot(a feature of suicides) on the x-axis represents the intensity of impact it had, the more it is away from the axis greater the intensity.
Let us try to understand how the features are affecting the model.
- Higher value of suicides_population tends to have a positive impact on the number of suicides
- Lower values of sex_female tend to have a positive impact on the number of the suicides.
- Higher values of GdpPerCapita tends to have a positive impact on the number of suicides
- Lower value of total_sulphur_dioxide tends to have a positive impact on the number of suicides, and so on.
We could also interpret the intensity of impact
SHAP partial dependence plot for a tree based model
Heatmap Visualization for Linear and Tree-based model
This visualization is a summarization of the entire dataset on how each data point in every feature is affecting the target
Here the Y-axis is Features
and X-axis is Instance of that feature
The color of the instance defines if it had a positive effect or negative effect by its color. If the instance is red then it had a positive effect and if the instance is blue then it had a negative effect
The intensity of the color is directly proportional to the intensity of the effect. Deeper the color, the more impactful the feature is
How are errors/residuals distributed and how interpretable is your model ?
Residual Analysis
- Here, we can see the striped lines of residuals, which is an artifact of having an integer value as a response value, instead of a real value. It can also be observed from the below graph that residuals are normally distributed. Residuals signify that residuals don’t have heteroscedasticity.
Variable Importance
- In the variable importance plot we can observe that the plot replicates the results which we got from the Linear Model and Tree-Based Model above. The variable importance has been scaled between 0 to 1 for ease of understanding
SHAP Summary
From the SHAP summary diagram, we can interpret a few conclusions –
1. All the features are listed as per their importance in making the prediction, that is suicide_population is more significant followed by population, and so on.
2. The position on the SHAP value axis indicates the impact the feature has on the prediction either positive or negative. That is, the more the data point is away from the 0.0 shap value — the more its impact is. As we can see suicide_population has the most impact on the number of the suicide. The color of the dot represents (Red — High, Blue — Low) the impact of the value on the result. i.e. — Hight amount of suicide_population results in high number of suicide. As we can see the distribution of red points in the suicide_population feature is far more spread than the blue points. From this visual, we can interpret that the number of suicide vastly increases as the amount of suicide_population increases. It doesn’t have much impact if the content is low. Let us analyze the distribution of SHAP values of this feature. As seen in the SHAP plot we can see that extreme values on both ends can have a significant effect on the number of the suicide.
The advantage of SHAP analysis over normal feature importance is that we could visualize how the feature is affecting the target at different values. The standard methods tend to overestimate the importance of continuous or high-cardinality categorical variables.
Partial Dependence Plot (PDP)
A partial dependence plot shows the marginal effect of a feature on the target. It is achieved by keeping all other variables constant and changing the value of one variable to get its PDP.
For Interpretation purposes, let us pick up the two most important variables — suicide_population and population.
1. As we can see that when the rest of the variables are kept constant and a marginal change is made in suicide_population, we can see the mean response increases between suicide_population levels of 0.0 and 10. This could be interpreted as this range of suicide_population could be the deciding factor in the number of the suicide.
2. Similarly when the rest of the variables are kept constant and a marginal change is made in population we can observe the mean response of number going up between the population range of 0.0 to 400000. Hence, it can be interpreted that this range is deciding factor in the number of the suicide.
The computation of partial dependence plots is intuitive: The partial dependence function at a particular feature value represents the average prediction if we force all data points to assume that feature value.
ICE (Individual Conditional Expectation) Plot
ICE plots and PDP plots are similar in that they both show the relationship between a feature and the model prediction. However, PDP plots show the average effect of the feature while ICE plots show the effect of the feature on specific instances.
To illustrate, consider the suicide_population and population features. In the PDP plot, we saw that the number of suicides increased when suicide_population was between 0 to 10 and population was between 0 to 400000. However, this may not hold true for all instances in the dataset. The ICE plot shows that for some instances, the number of suicides may increase significantly in this range (0th percentile instance), while for others, it may not change much (100th percentile instance).
Individual conditional expectation curves are even more intuitive than partial dependence plots as they show the prediction for a specific instance as we vary the feature of interest.
- Majority of the time should be invested in data preparation i.e. cleaning the data, normalizing, feature selection, imputation etc
- Hyperparameter tuning is the second most important thing after data preparation, which most of the practioner’s ignore. But the results are worth the time invested
- Multiple models must be trained and the best models should be selected to be deployed, as some algorithms perform much better than the other’s on specific tasks
- Model Interpretation(Unboxing the Black Blox) is the best takeaway from the series of this assignments. SHAP, LIME and PDP have made it easier to understand what made a model to predict a outcome.
- https://colab.research.google.com/drive/1HlZjigH7ZkXdb7n-_f8zAciwMM2B8yKK#scrollTo=5KTQ-xEuuAR8
- https://docs.h2o.ai/h2o/latest-stable/h2o-docs/grid-search.html
- https://towardsdatascience.com/explain-your-model-with-the-shap-values-bc36aac4de3d
- https://github.com/aiskunks/YouTube/blob/main/A_Crash_Course_in_Statistical_Learning/Full_ML_Report/Wine-Quality-Analysis.ipynb