Finding Optimal epochs using K-Fold for Transformer Models.

Usually fine-tuning a pre-trained Transformer model on some downstream task such as Text Classification may result in overfitting if the dataset is small, and thus we could leverage the K-fold technique in such scenarios to prevent the overtraining and better generalization of the model.

Shreya Goyal

Published in

Geek Culture

3 min readFeb 9, 2023

Understanding K-Fold:

As the name suggests, we are simply splitting our data into equal K partitions, and using a new partition each time as a Validation/Test set while Training the data on the rest of the K-1 partitions.

Figure1: K-Folds for Training, Image by author

I am using the TensorFlow framework to illustrate it in code but the same concept could be extended to PyTorch as well.

We will be using the Huggingface library for the pre-trained Transformer model.

Step 1: Splitting the data into required K-folds, eg., 5–7 folds. Don’t use a large number for K as it will increase the Training time.

from sklearn.model_selection import StratifiedKFold

kfold = StratifiedKFold(n_splits=5, shuffle=True, random_state=20)

for train_indices, valid_indices in kfold.split(df, df['label']):
    train_df = df.iloc[train_indices]
    valid_df = df.iloc[valid_indices]

Using a Stratified fold makes sure the labels are in the same ratio in the K-folds as in the original data and there is equal distribution of all classes in the folds.

Step 2: Tokenize the Training data and train the model on each Split.

from transformers import AutoTokenizer, DataCollatorWithPadding, 
                         TFAutoModelForSequenceClassification
from tensorflow.keras import optimizers
from tensorflow.keras import metrics
from tensorflow.keras import losses
from datasets import Dataset


def tokenize(x, tokenizer):
    return tokenizer(x['inputs'], padding=True, truncation=True)

def create_dataset(df):
    x, y  = df['text'], df['label']
    data_dict = {"inputs": x, "labels": y}
    
    tokenizer = AutoTokenizer.from_pretrained('bert-base-uncased') 
    #you can use any pretrained model as needed
    
    data_ds = Dataset.from_dict(data_dict)
    tokenized_data = data_ds.map(lambda x:tokenize(x, tokenizer),batched=True)
    data_collator = DataCollatorWithPadding(tokenizer=tokenizer, \
                                            return_tensors="tf")
    tf_dataset = tokenized_data.to_tf_dataset(
                columns=['input_ids', 'attention_mask'],
                label_cols=["labels"],
                shuffle=True,
                batch_size=8,
                collate_fn=data_collator
            )
    return tf_dataset

def build_model(num_labels):
    model = TFAutoModelForSequenceClassification.from_pretrained('bert-base-uncased',
                                                                 num_labels=num_labels)
    model.compile(
        optimizer=optimizers.Adam(learning_rate=5e-5),
        loss=losses.SparseCategoricalCrossentropy(from_logits=True),
        metrics=metrics.SparseCategoricalAccuracy()
    )
    return model

Instead of training each split on n number of epochs, we could use the EarlyStopping callback. For details, you could refer to the link.

import tensorflow as tf

def train_model(train_data, num_labels, valid_data=None, batch_size=8,
                 epochs_num=20):
    model = build_model(num_labels=num_labels)
    model_callback = None
    
    if valid_data is not None:
        early_callback = [tf.keras.callbacks.EarlyStopping(
            monitor='val_sparse_categorical_accuracy', patience=7, 
            restore_best_weights=True, mode='max', min_delta=0.001)]
        model_callback = [early_callback]

    history = model.fit(train_data,
                        validation_data=valid_data,
                        epochs=epochs_num, verbose=2, 
                        batch_size=batch_size, callbacks=model_callback)
    return history, model

Iterating over each split, and training till the validation data reaches its maximum accuracy.


all_val_accuracies = []

for train_indices, valid_indices in kfold.split(df, df['label']):
    train_df = df.iloc[train_indices]
    valid_df = df.iloc[valid_indices]
    
    train_tf_data = create_dataset(train_df)
    valid_tf_data = create_dataset(valid_df)
  
    history, model = train_model(train_tf_data, num_labels, valid_tf_data)
    
    val_accuracy = history.history['val_sparse_categorical_accuracy']
    #its a list of accuracy of each epoch on validation data
    
    all_val_accuracies.append(val_accuracy)

Step 3: Calculating average Accuracy for each epoch on all K-Validation splits and finding the epoch with maximum accuracy.

Let’s assume that first model trained till 15 epochs, second till 12 epochs and so on. So, we will first trim the list till the least epochs and then convert it into a 2d numpy array.

epochs_len = [len(each_hist) for each_hist in all_val_accuracies]
min_epoch = min(epochs_len)

trimmed_val_accuracies = np.array([each_hist[:min_epoch] for each_hist \
                          in all_val_accuracies])

Now, we need to calcuate the average accuracy on each epoch. For this, we have to first sum over the array along the axis=0, and then divide each by K. Since K is constant , we can just take the epoch with maximum sum.

sum_val_acc = np.sum(trimmed_val_accuracies, axis=0)

optimal_epoch = np.argmax(sum_val_acc) + 1

Step 4: Retrain the Model on complete data till optimal epochs n.

tf_data = create_dataset(df)

history, model = train_model(train_tf_data, num_labels, 
                  epochs_num=optimal_epoch)

model.save_pretrained("optimal-epoch-model", saved_model=True)

Using this method, does increase a considerable amount of time on Training and one could get away by simply splitting the data into 1 Training and Validation set and using Early Callback, as it works for most of the time, if validation data is enough.

But in case of very small datasets, and you need all the data for training and can’t keep some data points aside, k-fold technique could be used for tuning the hyperparameters.

Finding Optimal epochs using K-Fold for Transformer Models.

Usually fine-tuning a pre-trained Transformer model on some downstream task such as Text Classification may result in overfitting if the dataset is small, and thus we could leverage the K-fold technique in such scenarios to prevent the overtraining and better generalization of the model.

Written by Shreya Goyal