[ad_1]

Data was collected from North East of Andhra Pradesh, India. This data set has 894 samples, the Training Data set consists of 583 samples, and the Testing Data set consists of 311 samples.

There are eleven (11) attributes available in the data set and the ‘Gender’ and ‘Class’ attributes are nominal attributes while all the others are numerical attributes. The last attribute is a class field used to divide the data set into two groups as a liver patient or not.

Some variables were converted into the correct data type. Most of the variables were in the ‘object’ data type; therefore, converted into ‘float,’ and ‘int’ respectively. These changes were done using MS Excel.

There were some missing values in the original data set. They were imputed with the respective medians; as all the columns had many missing values in them. (also done on the Excel files)

Dummy values were created for the 2 categorical variables.

There were no duplicates found in the data set.

The train set and test set were named as they were. Because the train set was comparatively smaller to break it into train, test, and validation sets.

Very strong correlations were found among the variables DB and TB. Also, strong correlations exist between (SGOT, SGPT), (ALB, TP), and (ALB and AG_Ratio).

The response variable, ‘Class’ is unbalanced.

Then boxplots were drawn for all numerical variables and they showed many outliers in all of them. Some of them are given here.

## Hyper-parameter optimization/ Hyper-parameter tuning

This was carried out using the sci-kit — learn Python machine learning library. This resulted in a single set of well-performing hyper-parameters that were used to configure the models. Here, a Grid search was used. At the end of the run, the best score and hyper-parameter configuration that achieved the best performance are reported.

## Modeling

Three classification models were fitted to compare them to have a better model. Those were modeled using K- The nearest Neighbor algorithm, Random Forest, and Extreme Gradient Boosting (XGBoost).

## K — Nearest Neighbor Algorithm

First, the best parameters were obtained through Grid Search for the KNN algorithm.

Here, in the classification report precision values of the classes of the response 0 and 1 (Liver Patient: Yes (1), No (0)) are 1, and 0.99 respectively. Out of all participants who are liver patients that the model predicted, 99% are liver patients. Recall values for the same are 0.98 and 1.0. Out of all participants who are liver patients, the model predicted this correctly for 100% of the participants.F1 scores for the two classes are 0.99 and 1.0 respectively. This means that the model is good since both the values are close to 1. Also, the support values depict that among the participants in the test data set, 90 are not liver patients and 221 are liver patients. The accuracy of the model is 99%.

## Extreme Gradient Boosting Model

First, the best parameters were obtained for the XGBoost model.

Here in the classification report, precision values of the classes of the response 0 and 1 (Liver Patient: Yes (1), No (0)) are 0.89, and 0.75 respectively. These values are the percentages of correct positive predictions relative to total positive predictions. Out of all participants who are liver patients that the model predicted, only 75% are liver patients. Recall values for the same are 0.19 and 0.99. Out of all participants who are liver patients, the model only predicted this correctly for 99% of the participants.F1 scores for the two classes are 0.31 and 0.85 respectively. This means that the model predictions are not accurate and certainly there is an imbalance in the data. Also, the support values depict that among the participants in the test data set, 90 are not liver patients and 221 are liver patients. The accuracy of the model is 76%.

## Random Forest Model

The best parameters obtained for the Random Forest Classifier are given here.

Here in the classification report, precision values of the classes of the response 0 and 1 (Liver Patient: Yes (1), No (0)) are 1, and 0.92 respectively. These values are the percentages of correct positive predictions relative to total positive predictions. Out of all participants who are liver patients that the model predicted, only 92% are liver patients. Recall values for the same are 0.80 and 1. Out of all participants who are liver patients, the model only predicted this correctly for 100% of the participants. F1 scores for the two classes are 0.89 and 0.96 respectively. This means that the model is good since both the values are close to 1. Also, the support values depict that among the participants in the test data set, 90 are not liver patients and 221 are liver patients. The accuracy of the model is 94%.

## ROC Curves

ROC Curves were plotted for all 3 models.

The highest AUC value was observed for the KNN Algorithm. It has an AUC of 1, which means there is a 100% chance that the model will be able to distinguish between Response 1 and Response 0. XGBoost and Random Forest classifiers also have high AUC scores of 0.99 and 0.89 respectively.

## Feature Selection

Feature Selection was implemented to see if the model performance can be increased.

Here according to the feature importance, all variables except Gender look important. But since there are some correlations among them; features are selected through the following set of codes.

Features ‘Age’, ‘ALK’, and ‘SGOT’ were found to be the most important features in the data set.

Note that, since some correlations exist among some variables, they are rejected here with this method of feature selection. In such cases, one variable out of 2 which are correlated can stay in the model.

## KNN Algorithm Implementation with the Selected Variables

The best parameters obtained for the KNN Algorithm are given here.

Here in the classification report, precision values of the classes of the response 0 and 1 (Liver Patient: Yes (1), No (0)) are 1.0, and 0.99 respectively. These values are the percentages of correct positive predictions relative to total positive predictions. Out of all participants who are liver patients that the model predicted, 99% are liver patients. Recall values for the same are 0.97 and 1.0. Out of all participants who are liver patients, the model only predicted this correctly for 100% of the participants. F1 scores for the two classes are 0.98 and 0.99 respectively. This means that the model is good since both the values are close to 1. Also, the support values depict that among the participants in the test data set, 90 are not liver patients and 221 are liver patients. The accuracy of the model is 99%.

## KNN Algorithm Implementation with sampling techniques — SMOTE

Oversampling was applied to balance the response variable using SMOTE technique.

After oversampling, the response variable is balanced.

The best parameters obtained for the KNN algorithm are given here.

Here, in the classification report precision values of the classes of the response 0 and 1 (Liver Patient: Yes (1), No (0)) are 1, and 0.99 respectively. Out of all participants who are liver patients that the model predicted, 99% are liver patients. Recall values for the same are 0.98 and 1.0. Out of all participants who are liver patients, the model predicted this correctly for 100% of the participants. F1 scores for the two classes are 0.99 and 1.0 respectively. This means that the model is good since both the values are close to 1. Also, the support values depict that among the participants in the test dataset, 90 are not liver patients and 221 are liver patients. The accuracy of the model is 99%.

## Feature selection on the data set obtained using SMOTE

Then, feature selection was implemented on the new data.

The best parameters obtained for the KNN algorithm are given here.

Here, in the classification report precision values of the classes of the response 0 and 1 (Liver Patient: Yes (1), No (0)) are 1, and 0.99 respectively. Out of all participants who are liver patients that the model predicted, 99% are liver patients. Recall values for the same are 0.98 and 1.0. Out of all participants who are liver patients, the model predicted this correctly for 100% of the participants. F1 scores for the two classes are 0.99 and 1.0 respectively. This means that the model is good since both the values are close to 1. Also, the support values depict that among the participants in the test data set, 90 are not liver patients and 221 are liver patients. The accuracy of the model is 99%.

From there, it can be said that the model fitted with the KNN Algorithm with selected variables, is the best. Because although the models have the same accuracy, the one with the less number of features is the best model.

Let’s meet with another story soon!!

[ad_2]

Source link