I’m learning to use a machine learning model for a dataset of survey responses on vaccination rates. All features in the dataset are either nominal or ordinal categories. I applied the DecisionTreeClassifier from the scikit-learn library to identify important features.
The DecisionTreeClassifier didn’t accept the object data type. I had to use encoders to convert the object data type to numeric. I had two options: OneHotEncoder and OrdinalEncoder. After some research, here’s what I found. I’m still in the learning process, so please correct me if I’m wrong.
For ordinal categorical features, it’s recommended to use OrdinalEncoder as it can preserve ordinal information. Some models can utilize this ordinal information. For nominal categorical features, OneHotEncoder is suggested.
It’s important to note that the outputs of OrdinalEncoder and OneHotEncoder are different besides ordinal information. OneHotEncoder adds multiple columns, so besides ordinal information, you may evaluate the model’s performance according to encoders.
However, DecisionTreeClassifier doesn’t use ordinal information but relies on Gini impurity or information gain to make decisions. I experimented with different encoder versions:
- OrdinalEncoder for all features regardless of ordinal or nominal.
- OneHotEncoder for all features regardless of ordinal or nominal.
- Mixed encoders: OrdinalEncoder for ordinal features and OneHotEncoder for nominal features.
I evaluated the decision trees based on Gini impurity, and the best result was achieved using OrdinalEncoder with a Gini impurity of 0.247 for the dataset.