Machine learning is the field of study that enables computer systems to learn from experience without being explicitly programmed. In the process of creating a machine learning model, one of the most important tasks is data preparation.
Splitting the data into the right sets for training, testing, and validation is crucial to ensure that the model generalizes well on new data. In this article, we will discuss the importance of the train/dev/test sets and their role in machine learning model development.
The training set is the data used to train the machine learning model. It is the set of input data and output values used by the algorithm to learn the relationships between the features and the target variable.
Typically, the training set makes up around 70–80% of the total available data. The training set must be diverse, and representative, and contain enough samples to provide the algorithm with enough examples to learn from.
The algorithm can use statistical techniques, such as gradient descent or backpropagation, to adjust the model’s parameters to fit the data.
The dev set, also known as the validation set, is a subset of the data used to tune the model’s hyperparameters. Hyperparameters are parameters that are not learned from the data but are set before the training process begins.
Examples of hyperparameters include the learning rate, regularization strength, and the number of hidden units in a neural network. The dev set is typically made up of around 10–15% of the total available data, and it should be representative of the overall distribution of the data.
The dev set can be used to determine the optimal hyperparameters that maximize the model’s performance.
The test set is used to evaluate the performance of the machine learning model after it has been trained and tuned using the training and dev sets. The test set is a completely new and unseen set of data that the model has never seen before.
It is used to simulate the model’s performance on new, real-world data. Typically, the test set makes up around 10–20% of the total available data. The test set should be representative of the overall distribution of the data and contain examples of all the possible outcomes the model might encounter.
There are several techniques for splitting data into train/dev/test sets. One of the most common techniques is the holdout method, where the available data is split into two sets, one for training and one for testing.
Example:
from sklearn.model_selection import train_test_split
import numpy as np# Generate synthetic dataset
X = np.random.rand(1000, 10)
y = np.random.randint(0, 2, size=1000)
# Split dataset into train, dev and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
X_train, X_dev, y_train, y_dev = train_test_split(X_train, y_train, test_size=0.25, random_state=42)
# Print the shapes of the train, dev and test sets
print("Shape of X_train: ", X_train.shape)
print("Shape of y_train: ", y_train.shape)
print("Shape of X_dev: ", X_dev.shape)
print("Shape of y_dev: ", y_dev.shape)
print("Shape of X_test: ", X_test.shape)
print("Shape of y_test: ", y_test.shape)
Another popular method is k-fold cross-validation, where the data is divided into k equally sized subsets, and the model is trained and tested k times. Each time, a different subset is used as the test set, and the remaining subsets are used for training and validation.
In conclusion, the train/dev/test sets are essential for developing machine learning models that generalize well to new data. The training set is used to teach the algorithm the underlying patterns in the data, the dev set is used to optimize the model’s hyperparameters, and the test set is used to evaluate the model’s performance on new data.