[ad_1]

The objective of this blog is to analyse two large datasets of differing modalities (image and tabular) concerning average crop yield prediction. More specifically, the image dataset is comprised of remote sensing, satellite infrared images of soybean crop, while crop management and weather readings construct the tabular dataset. The goal is to use these historical readings and their associated average yield to train a regression model to infer average yield from future readings. Accurate crop yield prediction is vital in timely and cost effectiveness of export, vastly impacting global food production. The dataset can be downloaded from this **link**. These datasets are based on active research directions investigating method to infer supply effectively and securely and demand for the global food supply chain. The tabular dataset includes the average yield and 12 features which are defined below. The image dataset is comprised of histograms extracted from the infrared satellite images encoding a temporal component for your computation convivence. The task is to develop a set of regression models for automatically predict the average soybean yield given numerous crop management features and infrared satellite images of the crops themselves. *No prior knowledge of the domain problem is needed or assumed to fulfil the requirements of this problem statement.*

**Task 1: Linear models for regression**

The problem we aim to tackle has been clearly described and defined earlier. This task aims to develop and understand linear regression models and define the differences between them, perform strong hyperparameter search to leverage each model effectively, and analyse the trained model to identify which features are the influencing factors for crop yield prediction.

**Data:** The dataset ‘soybean_tabular.csv’ contains the average soybean for a given location within the US corn belt and corresponding crop management and weather information for a specific year. The purpose of this experiment is to explore the relationship between crop lifecycle information and the corresponding yield of crop produced under those conditions. The dataset is comprised of 506 records, where each record contains 12 features and the average crop yield. These are defined below:

- Variety: Crop seed variety.
- 4x Soil Components: {S_1, S_2, S_3, S_4}
- 3x Crop Management: {M_1, M_2, M_3}
- 4x Weather Components: {W_1, W_2, W_3, W_4}
- Yield: Average soybean yield.

*Unit of measurement or range of values of each feature are not relevant. However, features can be at different scales and/or measured in different units.*

**Task 1.1 — Importing the Data[1].**

For Importing the data I have used the read_csv module of the panda’s library as mentioned in the below code snippet.

After importing, let’s read the CSV file and store it in a variable, the output of the read_csv module is in the dataframe format. Also, I am checking for any null values in the dataframe so that I can preprocess the data. **(FYI — While reading the CSV file just provide the location where it is stored. ) Also, the images are in 9 dimentions so, I have converted it to vectors and stored in csv file which can be found in this ****link****.**

So, we have 503 rows and 13 columns and there are no null values present in the dataframe.

**Task 1.2 — Summarizing the data[1].**

For summarization of the data, I have used the ‘describe’ method for summarization of the dataframe. The ‘describe’ method generates descriptive statistics which includes count, mean, std, min, 25%, 50%, 75%, and max value excluding NaN values. The following code snippet shows the implementation of describe method

The ‘describe’ module of pandas doesn’t calculate the median and range so, I have used the median() of pandas as shown below. And for calculating the range, I have used the numpy library to include all the numbers of the dataframe and subtracted the minimum value from maximum values as shown in the below code snippet[2].

**Task 1.3 — Checking for the skewness of features[3].**

I have also checked for the skewness of the features, it’s not a must-have check but I wanted to apply log transform on multiple features where there are multiple outliers and the distribution has positive skewness. The below code snippet shows how I am visualizing the skewness in the columns using the seaborn library.

So, the above code snippet will show a box plot of skewness in the yield column and the below code will show the display(bar graph) for the same. Both of the above code snippets will give the following graphs.

I will deal with the skewness, in the next sub-part of the problem and I have decided not to touch the outliers, I will be leaving them as it is. It might be handy if we see a similar kind of data in the testing part as well.

**Sub Task 2.1 — Splitting the data[4].**

For splitting up the data, I have used train_test_split module of sklearn.model_selection, I have split the data into 60% train data, 20 % validation data, and 20% test data. Now, the question arises why do we split our data? The motivation is quite simple — the data is split into train, test, and validation to prevent the model from overfitting and to accurately evaluate the model. Overfitting is when the model accurately predicts all the points in the testing and validation phase. I have kept 60 percent of testing data so the model training will not be affected by less data passed while training(Undefitting). I have kept enough percentage of data in the testing and validation to perform operations. The following code snippet will show how to split the dataset –

**Task 2.1 — Removing the Skewness from Features[3].**

As discussed in task 1.3, we have seen that there are some features in the data which has some positive skewness to it and I have determined this from the following code snippet.

As you can see that there are some positive values in the features. I am just going to handle the positive skewness of the features and bring them on an equal scale. For removing the skewness, I have used the log function of the numpy library to log transform the values of features. Let’s see how the y_train looks after log transformation.

Before log transform.

The above data shows how the data is unevenly distributed in the plot, let’s see how we can remove this after log transformation.

Let’s also check the probability plot after log transformation,

**Task 2.3 — Standard Scaling the data[3] :**

The final step in preprocessing is standardizing. It is important to bring all the data points of each predictor to a normal scale because there might be a chance that some columns will be dominating over others. For this context, standard scaling is used where the features are scaled according to their variance. This is an important step before applying Ridge and Lasso Regression. The following code snippet shows how I applied the standard scaling to X_train, X_test, and X_val.

Linear regression training: There are numerous regression models available, yet for this task, I choose the following linear models for you to implement, describe and evaluate:

- Ridge Regression
- Lasso Regression

Before implementing the model let’s check what are Ridge and Lasso regression.

**Ridge Regression [5][7]–**

In machine learning, if we want to make predictions, we cannot oversee the unseen data (i.e. train data). Mostly the developer’s attention is limited to only models. Ridge allows us to regularize (shrink) coefficient estimates made by linear regression, which means that the estimated coefficients are pushed towards 0, to make them predict well on new datasets. This leads to using complex models and avoids overfitting at the same time. No doubt, Linear regression is one of the best estimators but ridge can achieve a lower MSE value than a Linear model. The formula is the same as that of linear regression but in ridge regression, the cost function is altered by adding a penalty equivalent to the square of the magnitude of the coefficients.

So, ridge regression puts constraints on the coefficients (w). The penalty term lambda regularizes the coefficients like if the coefficients take larger values then the optimization function is penalized. Ridge regression shrinks the coefficients and helps to reduce the model complexity and multicollinearity. Ridge is most useful when there is multicollinearity in the features and its purpose is to treat multicollinearity in the features.

**Lasso Regression [6][7]–**

Lasso regression is the same as that of Ridge regression with slight and important differences. The cost function of Lasso(Least Shrinkage and Selection Operator) regression can be written as –

Like the ridge regression cost function, the only difference is instead of taking the square of the coefficients, magnitudes are taken into account. This type of regularization can lead to zero coefficients i.e. some of the features are completely neglected for the evaluation of output. So, Lasso not only helps in overfitting but also helps in feature selection.

**Ridge Vs Lasso [6][7]–**

Both minimize the least-squares error functions but Lasso also helps in the feature section.

The diagram shows, how Lasso can reduce dimensions of feature space. Blue and green areas are contours and red ellipses are contours of errors. Both the methods determine coefficients by finding the first point where the elliptical contours hit the region of constraints. The diamond (Lasso) has corners on the axes, unlike the disk, and whenever the elliptical region hits such a point, one of the features completely vanishes[6]

**Task 3.1 — Ridge and Lasso regression Implementation[9].**

I have used the sklearns library to implement the ridge and lasso regression and for calculating the metrics also I have used the metrics module of sklearn. The below code snippet shows the same –

In the above code snippet, I have implemented a method where I will pass the models in the parameters along with the boolean values to check if it’s a pipeline or not. In the below code snippet, I will be passing both the models i.e Ridge and Lasso along with the alpha value.

GridSearchCV, by default, makes k=3 cross-validations. Here, I am only passing the alpha list as part of the experiment.

The method gives all the required scores for all Tests, Train, and Validations. As you can see in the above output on the training data set Lasso gives 0.72 of R_2 score on the training dataset, 0.73 on the test dataset, and 0.66 on the validation dataset. Whereas, on the other hand, Ridge gives and an R_2 score of 0.75 on the training dataset, 0.77 on the test dataset, and 0.69 on the validation dataset. The ridge model performs slightly better on Testing and validation datasets.

The alpha value for Lasso is — 0.02 and for Ridge is — 10. So, The best performing model here is Ridge and we will perform our hyperparameter tunning on the Ridge model.

**Sub Task 3.2 — Hyperparameter Tunning.**

I have also included Lasso for hyperparameter tunning just for experimenting with the model but our best performing model is Ridge which is obvious as discussed in the above section. Now, to improve the R_2 score for the validation dataset I have tried using the pipeline from sklearns.pipeline module along with PolynomialFeatures of sklearns.preprocessing module to check if there is any improvement for increasing the accuracy. I have also used the logspace module of numpy to generate the alphas for both ridge and lasso and passed this as a parameter with the help of GridSearchCV of sklearn.model_selection library. Pipeline usually helps in creating the transformers for the different variable types. In the code below I have created a polynomial feature along with Ridge and Lasso regression. As the performance of the model depends upon the values of hyper-parameters so, we need to try out all the possible values to know the optimal value. To do this manually, would take a considerable amount of time and resources, and thus we use GridsearchCV to automate the tuning of hyperparameters. The below code snippet will show how I implemented the GridSearchCV in the model.

The output is as shown below –

As you can see in the above output on the training data Ridge gives an R_2 score of 0.89 on the training dataset, 0.83 on the test dataset, and 0.76 on the validation dataset set. Whereas, on the other hand, Lasso gives 0.86 of R_2 score on the training dataset, 0.80 on the test dataset, and 0.73 on the validation dataset. The ridge model performs better on Testing and validation datasets. The alpha value for Lasso is — 0.0076 and for Ridge is — 18.3073.

**Feature importance:**

Again, using the best performing model, infer the features mostly influencing the average yield from the ridge or lasso coefficients. Explain how such coefficients give a measure of importance and evaluate your results indicating why these are the most important crop management features.

Right now there are almost 12 predictors and 1 target variable. Here we need to decrease the number of predictors. There are 2 options.

- Manually deleting the fields with less correlation to the target[5].
- 2. Using a regularization technique to pick only the most useful predictors[8]. So, let’s try deleting less correlated features manually with the help of a heatmap. For plotting a heat map I have used matplotlib and seaborn library to check correlated features. The below code snippet shows the implementation for the same.

To check the less correlated features, I am just using the corr() of the dataframe. If the correlation is greater than 0.4. I will be keeping them and the rest will be removed.

So, the above features have the highest correlation and I would keep them for model training. In the next step, I will be using Ridge regression as it’s the best performing model. In the code snippet below, I have used the LogisticRegression of sklearns module where I am penalizing the model bypassing the “C” value as 0.2 and penalty type as “l2” which means Ridge regression. Before passing the values I am standard scaling the X_train and using labelencoder for y_train because SelectFromModel of sklearns class doesn’t accept values without encoding and which will select in theory the features with coefficients non-zero.

Now lets visualize the features that we kept by ridge regression and print the total number of features we have.

As seen in the above output the feature — S_2, S_4, M_3, W_1, W_2, W_3 are selected and the rest are deleted. As the model penalizes the coefficients, it removes the features used in modeling. Remember that Ridge regularization does not shrink the coefficients to zero. Also keeping in mind, that increasing the c value will result in removing more features so after some runs, I concluded to keep the c value to 0.2.

After this we can apply the logistic regression with the above extracted features. Check this link for the code file.

- Pandas, ‘https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.describe.html’.
- Stackoverflow (Range of values), ‘https://stackoverflow.com/questions/24748848/pandas-find-the-maximum-range-in-all-the-columns-of-dataframe’.
- Ridge Regression, ‘https://towardsdatascience.com/the-power-of-ridge-regression-4281852a64d6’.
- Sklearn Train Test and Validation split, ‘https://stackoverflow.com/questions/38250710/how-to-split-data-into-3-sets-train-validation-and-test’.
- Ridge Regression, ‘https://stats.stackexchange.com/questions/486281/if-there-any-benefit-to-using-ridge-regression-in-a-simple-linear-regression-pro’.
- Lasso Regression, ‘https://towardsdatascience.com/ridge-and-lasso-regression-a-complete-guide-with-python-scikit-learn-e20e34bcbf0b’.
- Feature selection, ‘https://towardsdatascience.com/feature-selection-using-regularisation-a3678b71e499’.
- Hyper parameter tunning, ‘https://alfurka.github.io/2018-11-18-grid-search/’.

[ad_2]

Source link