Abstract

This is a follow-up post on the authorship identification project.
I regard the past few years as the inception of the era of Transformers which started with the popular Research Paper “Attention is all you need” by “somebody” in 2020. Several transformer architectures have shown up since then. Some of the famous ones are - GPT, GPT2, and the latest GPT3 which has outperformed many previous state-of-the-art models at several tasks in NLP, BERT (by Google) is also one of the most popular transformers out there.
Transformers are very large models with multi-billions of parameters. Pretrained transformers have shown tremendous capability when used with a downstream task head in Transfer Learning similar to the CNNs in Computer Vision.
In this part, I’ll use fine-tuned DistilBERT transformer which is a smaller version of the original BERT for the downstream classification task.
I’ll use the transformers library from Huggingface which consists of numerous state-of-the-art transformers and supports several downstream tasks out of the box. In short, I consider Huggingface a great starting point for a person engrossed in NLP and it offers tons of great functionalities.
I’ll provide links to resources for you to learn more about these technologies.

import keras
import tensorflow as tf
import numpy as np
from pathlib import Path
from utils import plot_history
from keras.preprocessing import text_dataset_from_directory

ds_dir = Path('data/C50/')
train_dir = ds_dir / 'train'
test_dir = ds_dir / 'test'
seed = 1000
batch_size = 16


train_ds = text_dataset_from_directory(train_dir,
                                     label_mode='int',
                                     seed=seed,
                                     shuffle=True,
                                     validation_split=0.2,
                                     subset='training')

val_ds = text_dataset_from_directory(train_dir,
                                      label_mode='int',
                                      seed=seed,
                                      shuffle=True,
                                      validation_split=0.2,
                                     subset='validation')

test_ds = text_dataset_from_directory(test_dir,
                                       label_mode='int',
                                       seed=seed,
                                       shuffle=True,
                                       batch_size=batch_size)

class_names = train_ds.class_names

from utils import prepare_batched
from transformers import DistilBertTokenizerFast

AUTOTUNE = tf.data.AUTOTUNE

tokenizer = DistilBertTokenizerFast.from_pretrained('distilbert-base-uncased')

batch_size = 2

train_ds = prepare_batched(train_ds, tokenizer, batch_size=batch_size)
val_ds = prepare_batched(val_ds, tokenizer, batch_size=batch_size)
test_ds = prepare_batched(test_ds, tokenizer, batch_size=batch_size)

from transformers import TFAutoModelForSequenceClassification
keras.backend.clear_session()

model = TFAutoModelForSequenceClassification.from_pretrained(
    'distilbert-base-uncased', num_labels=50)

model.compile(
    optimizer=tf.keras.optimizers.Adam(lr=5e-5),
    loss=tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True),
    metrics=tf.metrics.SparseCategoricalAccuracy()
)

history = model.fit(train_ds, validation_data=val_ds, epochs=20)

plot_history(history, 'sparse_categorical_accuracy')
model.save("DistilBERT_finetuned.h5")

print("Evaluate the model on test dataset")
model.evaluate(test_ds)