Machine Learning Project Walk-Through With Python
Having individual pieces of knowledge such as knowing what is data and how to play with it to obtain some useful information, and understanding some basic knowledge on machine learning model is very good, but knowing how to combine or put all together to tackle a data science problem will be efficient for you and others with the same interest of working on some open-ended problems to solve.
This blog post walks through a complete machine learning project with a real-world dataset; In this project, we try to bring important pieces of machine Learning together.
Overview of this machine learning project:
- Download and import useful libraries, load and update the data
- Exploratory data analysis
- Data cleaning
- Baseline model prediction
- Feature engineering and selection
- Train, evaluate, and Compare machine learning models
- What we observed and learned from this project as a beginner
- Conclusions and future work.
In this project we gradually go from one step to the following step implementing with Python, using libraries such as opendatasets, matplotlib, seaborn, sklearn. The complete project is available on Jovian notebook where it can be find here
Before diving into any machine learning project coding part, first look at the problem and understand what needs to be solved and how to solve it. In this project, we will work with The Sberbank Russian Housing Market Dataset downloaded from Kaggle.
We can’t jump into a large sea before we learn and practice swimming in a pond or swimming pool. Just like that, if you directly start with the coding part without understanding the problem, then it will seem overwhelmingly complex for you.
The data set is a housing price data set type where the aim is to predict houses sale price. The target variable is called
price_doc in train.csv. The training data is from
August 2011 to
June 2015, and the testing data is from
July 2015 to
May 2016. The dataset also includes information about overall conditions in Russia’s economy and finance sector, so you can focus on generating accurate price forecasts for individual properties, without needing to second-guess what the business cycle will do.(External data were allow if obei to the competition rule).
- train.csv, test.csv: information about individual transactions. The rows are indexed by the “id” field, which refers to individual transactions (particular properties might appear more than once, in separate transactions). These files also include supplementary information about the local area of each property.
- macro.csv: data on Russia’s macroeconomy and financial sector (could be joined to the train and test sets on the “timestamp” column)
- sample_submission.csv: an example submission file in the correct format
- data_dictionary.txt: explanations of the fields available in the other data files
The objective is to use the provided individual property information data to build a model that will predict
sale price, the problem is a supervised regression machine learning task; Because we have access to both the features and the target price, our aim is to train a model that can learn a correlation between the independent and dependent variable.
Here our goal is to develop a model that is both accurate, in order to predict price close to the true value which is tested on approximately 65% of the test data; Understandable so that we can understand the model predictions. Knowing the goal guide our decision during the implementation process from analyzing the data and getting insight to build models.
Important: Evaluation metrics.
In this challenge evaluation metric was
root mean squared log error(RMSLE) to be the appropriate cost function. we also used
root mean squared error (RMSE) as our evaluation metrics.
- Download and import useful libraries, load and update data
We download useful libraries such as
opendatasets to load the data from Kaggle,
pandas to read the data as an csv file,
numpy to manipulate the data,
seaborn, matplotlib to plot graph, explore and analyze our data which is helpful to get insight. The fact that Some raw data details have been update and provided in the
BAD_Ad=DDRESS_FIX.xlsx file, after loading the data we first update to obtain the newly updated information.
The training dataset has 30471 entries with 291 columns plus one target columns named price_doc , from 0 to 30470 , The test set 7662 entries both with 291 columns. The fixed file contains 699 entries, 279 Columns.
The macro dataset has 2484 entries, and 100 columns.
After looking at macro data we wanted to understand its use of it, or what positive impact it can have on the model prediction result. We try to look at each column name. Knowing that the macro data contains Russia’s macro economy and financial sector, it was given as features that could be important and efficient for property sale price prediction. we did a simple google search on factor that have an influence house price and came out with 4 major elements such as : Location, size , age and condition, demographic
Besides all these aspects property sale prices in Russia can also be influenced by:
1. price of petrol oil 2. volume of all money transacions 3. value of gold 4. USD and EURO rates
We thought the macro data have some columns that can impact the final result so we decide to consider the macro data, even though it contains numerous of missing data that we should find the best way to deal with, in such a maner that they wouldn’t harm our prediction model. We add the macro data information to train and test set by merging on timestamp columns and obtain a new train and test dataset with 390 columns.
2. Exploratory data analysis(EDA)
EDA is a very necessary step in machine learning project. It’s a step where we analyze the data using visualization graph to undertsand the relationships within the data, find trends, detect anomalies and get some insight.
Why EDA before Data Cleaning?
It really depends how comprehensible the data is and what your goals are. Sometime visualizing the data helps you to detect what to clean or preprocess, sometimes cleaning the data first gives you a better presentation. Notice some clear gaps in a continuous variable? Maybe binning helps. A handful of categories(unformatted/erroneous/etc)? Cleaning helps. These dudes go hand in hand. ‘Benn Beckman’
In this project, we realize that compared to other real word datasets the Sberbank dataset has fewer entries but has many columns. Before changing or modifying the data we need to understand what seems abnormal, and observe which columns can be important, e.g: in a real-life elements such as the total square meter, location, age of the building are very deterministic for the cost of a house. We decide to do the EDA to know which variable can be adjusted, clean, or need to pay more attention. In this step what we tend to observe is:
1. Target columns distribution
2. Missing values
3. Relationship between numerical variables and target
4. Relationship between Categorical Variables and target
5. Level of cardinality of Categorical Variables
Checking the maximum and minimum price per square meter, observe the distribution of our target column ‘price_doc’ around all entries.
Doing EDA also means searching for relationships between the features variable and the target.
Missing values : we plot the relationship between columns with missing values and the target(price_doc) . using the barplot to have a brief idea about how many missing values we should deal with and the percentage they represent compared to non-missing values.
What we observe is many columns have missing values and columns with substring ‘walk’ have more missing values than not null values.
a. Temporal relate variables: build_year, timestamp
Here one of the most important deterministic element of a house sale price is the build year.
Most of the time new houses are more expensive meanwhile old house with long history can also cost a fortune.
after displaying the number of existing unique value in ‘build_year’ column, we realize that there are wrong values that could be related to human typing error. We plot relationship between column build_year and target price_doc;
To better explore the data the type of visualization graph used is also an important element to help during analyzing.
We used histplot and scatter plot and observed that scatter plot gave us a better reprsentation of the house builded years. Then use boxplot where we can observe several outlier point above the maximum. The next temporal relate variable is the recording date which is the timestamp column where we can observe that 2014 was the years with the most recorded house data.
b. Discrete and continous variables
Here we considered numeric column with less than 20 unique values as dicreste(106 columns) and with more than 20 unique values as continous(263 columns).
b.1 Discrete variables
We plot the relationship between each discrete column and the price median using barplot; A very strange situation observed was the num_room column representing the number of room where the price decrease drastically when the number of room is more than 9; We plot the correlation between discrete column and observe that columns with suffix ‘_raion’ and prefix ‘cafe_’, ‘market_’ shows somes strong correaltion compare to others.
b.2 Continous variables
we used histogram to plot the relationship between continous variable and target where we observed that all data are collected from area within 40km from krimelin.
# Analyse the continous values by creating histogram to understand the distribution. example : 'raion_population'
for col in continous_cols:
df = train.copy()
we also use scatter plot to display the same relationship by taking the log of target price, it help describing intervale where most recorded data point land.
# let's see the same relationship apply log example : 'raion_population'
for col in continous_cols:
if 0 in df[col].unique():
df[col] = np.log(df[col])
df["price_doc"] = np.log(df["price_doc"])
We decided to use boxplot to detect outlier from each continous column where we observe columns with suffix ‘_km’, prefix ‘cafe_’ presenting a number of outlier that need to be observe closely and find the best alternative to deal with.
Here we plot the relationship between categorical variable(18 columns) and target. Variable sub_area is the only varibale with level of cardinality more than 5.
We can observe sub_area with highest price, the price median on Investment seems to be higher than on OwnerOccuper. Others plot are explored to analyze house price tendency around recorded years, months and distribution base on state.
AS the final part of our EDA we decided to use Pairsplot which is a great tool from seaborn librairies that allow to see the relationship between multiple pairs of variable as well as distributions of single variables.
Plot relationship between variable we think that could be capital on deciding the house price. The observation here is a positive correlation relationship between each of those variable.
3. Data cleaning
Some ML project require less cleaning work, this dataset seems to be more messy and need a serious cleaning work. Some of the columns name are easy to guess or understand e.g: full_sq, floor, whereas some columns name seems difficult to understand.
In some task we can make an accurate model without having any knowledge of what the variables are, but knowing and understanding some variables(columns) is very important and helpful.
To have a brief idea on properties sales price in Moscow we decided to spend few minutes online searching for related informations.
From this picture we observed that price are between 180–700 thousands ruble per square meter. The average monthly net salary after tax 76.471 ruble. Such information are helpful during cleaning to set some limite when modifying the data. In this step we will focus on variable that we consider as important, data cleaning was applied to both train and test set.
3.1 Build year (build_year)
We first display the columns describtion to get basic statistic related information where we observe that 75% of houses were build before 2012, the maximum build year was 2019 , meanwhile houses information were recorded between 2011 and 2016.
We considered build year more than 2016 to be houses under construction. Incorrect input here could be human typing mistake as some houses building year were registered as: nan, 0, 1, 2, 3, 20,71, 215, 20052009. Knowing that most of the time year are made of 4 diggit we replace 20 by 2000, 71 by 1971, 215 by 2015, 20052009 by 2005 The fact that most information are houses builded between 1800 to 2019 , we consider 1691 as 1961.
3.2 Full square meter & Life square meter (full_sq, life_sq)
This two variable are related each other, we assign nan values to all houses with life square meter bigger than full square meter. Life square meter less than 5, and full square meter less than 5 was also assigned nan values.
Several entries index was presenting annomalies in this two variable that was detected after cheking the data several time for example:
houses with full_sq=403 and life_sq=1.0 and num_room=2, seems abnormal in real life, we change both full_sq and life_sq to nan.
When full_sq > 210 and the ratio life_sq/ full_sq is <0.3 then we assign a nan values to the full_sq, which simply means if a house take less than 30% of the full territory that is more than 210 square meter then something might be abnormal.
3.3 Kitchen square meter(kitch_sq) & Number of room(num_room)
Here we assign all kitch_sq more than 500 square to nan, kitch_sq greater than difference between full_sq and life_sq. Kitch_sq should be less than life_sq which means automatically less than file_sq.
We assign nan values to all entries with ‘num_room’ equal to zero, we consider houses with life_sq more than 150 to have aleast 2 rooms.
3.4 Maximum floor & Floor (max_floor, floor)
Before doing any modification to values here, we did search some information about the tallest building in Moscow which seems to be 101 floor with location different than the one recorded, for this case we limited the max_floor to 57, and make sure that max_floor should be ≥ floor .
3.5 Others: State, Ecology, Product_type
We realize that the dataset contains 5 uniques states with the fith(33) having only one entrie which could probably be a typing error as others are between 1 to 4, we consider as an outlier and assigned a nan value.
Ecology is an ordinal categorical columns type, to better reflect this order, we replace the contains with diggit in an increasing order.
We assigned the values ‘Investment’ to all houses having product_type variable as nan, selected data within a reasonable range base on the price per squares meter. We sligthly reduce the price of houses on Investment builded after 2013.
3.6 Drop less important columns.
In this step all columns with missing values higher than our fixed threshold(80%) , and columns with only one unique values were drop as we will not learn more about the target with such columns;
Columns with very low variance can be dropped from the data because, they don’t help in predicting the target variable.
After cleaning the data we splited the train data into two set (train, valid) in respect to datetime. Valid set representing 20% of the original training set.
4. Baseline model training and first observation
Here we first train a naive baseline model that return the median value of the target it’s essentially to compare others model results with. Others machine learning model will be necessary for the task if and only if it can beat this naive baseline model, if not we should try different approaches.
Training the baseline model requiered less expertise, less time and is less complex; The objectif is not to get the best accuracy but to help us understand the task how far we are form our goal and approximately how much effort is needed to achived higher accuracy, and they could also guide us on which type of complex model we can use, at a certains level help us understanding our data.
We train Linear and Ridge regressor model which return error when estimating the root mean square log error, because both predictions contains negative value. We also train Tree based model such as Decision Tree, Random Forest, ExtraTrees, XGBoost, LightGbm and CatBoost.
with Lightgbm having a slightly better prediction accuracy which leading us to top 80% on private Leaderboard. We took best 4 model train for a second time with 267 columns(positive correlation with target & categorical variables) and obtain the following accuracy :
Both Lightgbm and Catboost lead us to top 71%. compare to others, Lightbgm seems robust and keep maximum informations related to target although we limited the number of variable. We decided to focus on this two models to achieved better acuracy.
We already have an idea on which model could lead us to better result but what about the best feature set? which features contains relevant informations which is crucial as the model will learn from these features to give better predictions.
5. Feature Engineering, data preparation and feature selection
Feature engineering and selection are very important step in mahcine learning pipeline; This step is crucial as in the previous step, we did not only get intuition cocerning the model we could use, but also the advantage of selecting good features set which can achieve a better result.
5.1 Feature engineering
It’s the process of extracting, transforming and creating new features variable from existing variable. We can simply consider feature engineering as creating additional features from the raw data we have. In this step we used external dataset such as longitude and latitude data of houses, extracted datetime related features and transform several features to create new others features by combining, adding, subtracting or taking the ratio.
Transfromation and creations of new features were inspired by previous task and other works including ideas shared from the challenge discussion forum. The goal is to obtain the maximum number of features with important information.
Store everthing, but don’t try to cook with all of it at once(should be selective). ‘’Cassie Kozyrkov’’
a. Add external data such as house longitude and latitude.
b. Group sub_area into district.
c. Extract date related features.
d. Fusion or combination of features
e. Ratio among features
f. Difference within features
g. Multiply features
Obtain a total of 440 features after feature engineering.
h. PCA feature reducing
Applied principal component anlaysis algorithm with 6 as number of components on 82 features which seems to be less important.
After using PCA to project the 82 features from original data into 6 dimenssions with no particular meaning assigned to each(pca_0,…,pca_5). We obtain a dataset with reduced total number of feature to 364.
i. Feature correlation
Here we find the correlation between numerical features and tagert column; Selected only numerical columns with correlation greater than zero and obtain a total of 214 features out of 343.
214 numerical features plus 20 categorical features were remaining for a total of 234 considered as the new features set.
Remove outlier based on real-life anlysis: After grouping sub_area into district, we named district where life cost seems expensive(city center and its surrounding area) as expensive_district. Based on the picture in the data cleaning step, which shows that city center’s minimum price per square meter is 315000 ruble . We decide to drop all entries recorded from 2014 with the district considered as not expensive_district where the price per square meter is more than 315000 ruble. This helped us eliminate around 168 entries, most were outliers.
5.2 Data preparation
In this project, we use median strategy to impute numerical columns and most-frequent for categorical columns. Used StandardScaler to scale the data between min and max so that they fall within a range from min to max. LabelEncoder was use to encode categorical variable.
5.3 Feature Selection
The aim of feature selection is to choose the most relevant features among the data; In feature selection , we find the set of features with most positive impact on the predictions.
Machine learning model need the data to learn from it; Getting the right data which includes relevant information is crucial for our task as it can help model generalize better to new data and create a more interpretable model. ‘If you put garbage in, you get garbage out’.
In this project we use the rfpimp feature selection algorithm which is a library that provides feature importances, based upon the permutation importance strategy, for general scikit-learn models and implementations specifically for random forest.
feature_importance = feat_importance(train_inputs,train_targets)column= train_inputs.columns[feature_importance.Importance>0]X_train = train_inputs[column].copy()#columnX_valid = valid_inputs[column].copy()X_test = test_inputs[column].copy()
After experimenting different number estimators hyperparameter, looking for which set of features performs better, the most accurate set of features was 152 features columns which we obtained when taking the log of our target values. The feature selection algorithm was train using only our train set.
6. Train and evaluate
After completing first 5 step, before starting with modeling which is the final setp to take; We defined a helper function to print evaluation metric result such as : root mean squared log error(rmsle), mean absolute error(mae), Average mean absolute error and score all form Sklearn libraries.
There are bunch of metrics for regression problems; Nevertheless it’s always good to pick a single metric to evaluate models. Rmsle was recommendate for this challenge.
As usual, during training, we provide the training set features column to our model along with target; The purpose is for the model to learn a mapping between independent(features columns) and dependent variable(target). The validation set was use to check how the model behave on new observations , on data that it have not seen or learn from it. Validation set helps us during model hyperparameter tuning. The testing set is used to evaluate the trained model, it doesn’t have target value, there is no answers. The prediction is made using only the features; Accuracy is obtained after submission to the challenge page where we can see the result after our predictions are compared with the answers.
Here we decided to focus on models that shown promessing result during the baseline training before feature engineering and selection, which are: LightGBM and Catboost;
The two models were trained with default hyperparameters, our best score on public and private liderboard was obtained using Catboost, which land in the top 27%; This result was achieved after several submissions. Meanwhile the model performance seems reasonable during evaluation on training and validation set, the test result was not as good as we expected.
7. What do we learn from this project as a beginner?
The Sberbank challenge is an old challenge on which we decided to work on to gains basic skills. It was a long journey where we did learn a lot :
After several submission and trying different
features set(feature engineering), we achieved our best score on private leaderboard
0.32010 and couldn’t cross that score for a certains period.
StandardScaler work better which is not the case all the time, as sometimes
MinMaxScaler could be the best option but here the different is just very thin. We did improve later to
0.31928 after deciding to use
LabelEncoding for categorical variable.
The added external features(
longitude and latitude) also had a good impact improving our prediction. We went from
0.31877 (top 30% private leaderboard) by using median as imputing technique instead of mean.
‘It Seems that, when dealing with outliers median seems to be the better option as the median will still be reasonable whereas the means will drastically change’.
0.31793(top 27% private leaderboard) was achieved by removing some data(
outliers) base on two major elements:
- the picture at data
cleaning stagewhere price/sq in city center should be between
- Our prediction values is more than 2.5 time higher than the reel value for houses mostly recorded after from 2014.
Important tip for beginner : Always paid attention to challenge discussion forum specially topic with higher number of upvote, there are bunch of informations relate to dataset, features and even model behavior. We can get more insight by checking what people shared and discussed about concerning the task and implement it with success in similar upcoming task.
8. Conclusions and Future work
In this blog post, we walked through the important steps of a machine learning project. We use a real-world dataset that had a huge amount of data errors, compared to other similar projects on house sales(estimation) prices, this data has much more feature columns, with fewer entries. After describing the task we :
a. Performed an exploratory data analysis where we learned much about the dataset.
b. Cleaned and fixed raw data, focusing on variable that we observe as important.
c. Transform, created new features which among them some was selected in the best set of features to train our models.
d. Train baseline model to compare with complex models
Our current result is till far from our expectation; For the next step we will perform hyperpameters tuning to optimize the model; Created and train different model for each product_type category(OwnerOccupier, Investment); Check for outliers, because for several entries our best model prediction price was 2 to 3 time higher than target price mostly entries recorded between 2014~2015 .