Machine Learning News Hubb
Advertisement Banner
  • Home
  • Machine Learning
  • Artificial Intelligence
  • Big Data
  • Deep Learning
  • Edge AI
  • Neural Network
  • Contact Us
  • Home
  • Machine Learning
  • Artificial Intelligence
  • Big Data
  • Deep Learning
  • Edge AI
  • Neural Network
  • Contact Us
Machine Learning News Hubb
No Result
View All Result
Home Artificial Intelligence

New Scikit-Learn is More Suitable for Data Analysis | by Saptashwa Bhattacharyya | Mar, 2023

admin by admin
March 8, 2023
in Artificial Intelligence


Pandas Compatibility and More in Scikit-Learn Version ≥1.2.0

Some pretty cool updates in the New Sklearn! (Source: Author’s Notebook)

Around last year December, Scikit-Learn released a major stable update (v. 1.2.0–1) and finally I get to try some of the highlighted new features. It’s now more compatible with Pandas and a few other features will also help us in regression as well as classification tasks. Below, I go through some of the new updates with examples of how to use them. Let’s begin!

Compatibility with Pandas:

Applying some data standardization before using them for training an ML model like regression or neural net is a common technique to make sure different features with various ranges get equal importance (if or when necessary) for predictions. Scikit-Learn provides various pre-processing APIs like StandardScaler , MaxAbsScaler etc. With the newer version, it is possible to keep the dataframe format even after the pre-processing, let’s see below:

from sklearn.datasets import load_wine
from sklearn.model_selection import train_test_split
########################
X, y = load_wine(as_frame=True, return_X_y=True)
# available from version >=0.23; as_frame
########################
X_train, X_test, y_train, y_test = train_test_split(X, y, stratify=y,
random_state=0)
X_train.head(3)
Part of the Wine dataset in Dataframe format

The newer version includes an option to keep this dataframe format even after the standardization:


############
# v1.2.0
############

from sklearn.preprocessing import StandardScaler

scaler = StandardScaler().set_output(transform="pandas")
## change here

scaler.fit(X_train)
X_test_scaled = scaler.transform(X_test)
X_test_scaled.head(3)

Dataframe format is kept as it is even after standardization.

Before, it would have changed the format to a Numpy array:

###########
# v 0.24
###########

scaler.fit(X_train)
X_test_scaled = scaler.transform(X_test)
print (type(X_test_scaled))

>>>

With the dataframe format remaining intact, we don’t need to keep tabs on the columns, like we needed to do with the Numpy array format. Analysis and plotting become easier:


fig = plt.figure(figsize=(8, 5))
fig.add_subplot(121)
plt.scatter(X_test['proline'], X_test['hue'],
c=X_test['alcohol'], alpha=0.8, cmap='bwr')
clb = plt.colorbar()
plt.xlabel('Proline', fontsize=11)
plt.ylabel('Hue', fontsize=11)
fig.add_subplot(122)
plt.scatter(X_test_scaled['proline'], X_test_scaled['hue'],
c=X_test_scaled['alcohol'], alpha=0.8, cmap='bwr')
# pretty easy now in the newer version to see the effect

plt.xlabel('Proline (Standardized)', fontsize=11)
plt.ylabel('Hue (Standardized)', fontsize=11)
clb = plt.colorbar()
clb.ax.set_title('Alcohol', fontsize=8)
plt.tight_layout()
plt.show()

Fig. 1: Dependence of features before and after standardization! (Source: Author’s Notebook)

Even when we build a pipeline, each transformer in the pipeline can be configured to return dataframes as below:


from sklearn.pipeline import make_pipeline
from sklearn.svm import SVC

clf = make_pipeline(StandardScaler(), SVC())
clf.set_output(transform="pandas") # change here
svm_fit = clf.fit(X_train, y_train)

print (clf[:-1]) # StandardScaler
print ('check that set_output format indeed remains even after we build a pipleline: ', 'n')
X_test_transformed = clf[:-1].transform(X_test)

X_test_transformed.head(3)

Dataframe format can be kept as it is even within a pipeline!

Fetching DataSet is Faster and More Efficient:

OpenML is an open platform for sharing datasets and the Dataset API in Sklearn offers fetch_openml function to fetch data; With the updated Sklearn, this step is more efficient in memory and time.


from sklearn.datasets import fetch_openml

start_t = time.time()
X, y = fetch_openml("titanic", version=1, as_frame=True,
return_X_y=True, parser="pandas")
# # parser pandas is the addition in the version 1.2.0

X = X.select_dtypes(["number", "category"]).drop(columns=["body"])
print ('check types: ', type(X), 'n', X.head(3))
print ('check shapes: ', X.shape)
end_t = time.time()
print ('time taken: ', end_t-start_t)

Using parser='pandas' makes a drastic improvement in runtime and memory consumption. One can easily check the memory consumption using psutil library as:

print(psutil.cpu_percent())

Partial Dependency Plots: Categorical Features

Partial dependency plots existed before too, but only for numerical features, now this has been extended for categorical features.

As described in the Sklearn documentation:

Partial dependence plots show the dependence between the targets and a set of input feature(s) of interest, marginalizing over the values of all other input features (the ‘complement’ features). Intuitively, we can interpret the partial dependence as the expected target response as a function of the input features of interest.

Using the ‘titanic’ dataset from above, we can easily plot the partial dependence of categorical features:

With the code block above, we can get partial dependency plots as below:

Fig. 2: Partial dependency plots of categorical variables. (Source: Author’s Notebook)

With version 0.24, we would be getting a value error for categorical variables:

>>> ValueError: could not convert string to float: ‘female’

Directly Plot Residuals (Regression Models):

For analyzing the performance of a classification model, within Sklearn metrics API, plotting routines like PrecisionRecallDisplay , RocCurveDisplay existed in older versions (0.24); In the new update, it is possible to do similar for regression models. Let’s see an example below:

Linear Model fit and corresponding residuals can be directly plotted using Sklearn. (Source: Author’s Notebook)

While it’s always possible to plot the fitted line and residuals using matplotlib or seaborn, after we’ve settled down with the best model, it’s great to be able to quickly check the results directly within Sklearn environment.

There are a few more improvements/additions available in the new Sklearn, but I found these 4 major improvements to be particularly useful for standard data analysis more often than not.

References:

[1] Sklearn Release Highlights: V 1.2.0

[2] Sklearn Release Highlights: Video

[3]All the plots and codes: My GitHub

If you’re interested in further fundamental machine learning concepts and more, you can consider joining Medium using My Link. You won’t pay anything extra but I’ll get a tiny commission. Appreciate you all!!



Source link

Previous Post

Getting Started with Machine Learning in AWS: Part 1 — Identity Access Management (IAM) for Data Scientists | by John Patrick Laurel | Mar, 2023

Next Post

What is OpenCV? The Complete Guide (2023)

Next Post

What is OpenCV? The Complete Guide (2023)

How to Stay on Top of What’s Going on in the AI World

The power of continuous learning

Related Post

Artificial Intelligence

10 Most Common Yet Confusing Machine Learning Model Names | by Angela Shi | Mar, 2023

by admin
March 26, 2023
Machine Learning

How Machine Learning Will Shape The Future of the Hiring Industry | by unnanu | Mar, 2023

by admin
March 26, 2023
Machine Learning

The Pros & Cons of Accounts Payable Outsourcing

by admin
March 26, 2023
Artificial Intelligence

Best practices for viewing and querying Amazon SageMaker service quota usage

by admin
March 26, 2023
Edge AI

March 2023 Edge AI and Vision Innovation Forum Presentation Videos

by admin
March 26, 2023
Artificial Intelligence

Hierarchical text-conditional image generation with CLIP latents

by admin
March 26, 2023

© 2023 Machine Learning News Hubb All rights reserved.

Use of these names, logos, and brands does not imply endorsement unless specified. By using this site, you agree to the Privacy Policy and Terms & Conditions.

Navigate Site

  • Home
  • Machine Learning
  • Artificial Intelligence
  • Big Data
  • Deep Learning
  • Edge AI
  • Neural Network
  • Contact Us

Newsletter Sign Up.

No Result
View All Result
  • Home
  • Machine Learning
  • Artificial Intelligence
  • Big Data
  • Deep Learning
  • Edge AI
  • Neural Network
  • Contact Us

© 2023 JNews - Premium WordPress news & magazine theme by Jegtheme.