ALBERT – A Lite BERT

0
1253

Every researcher or NLP (Natural Language Processing) practitioner is well aware of BERT (Bidirectional Encoder Representations from Transformers) which came about in 2018. Since then the NLP industry has surely transformed by a much larger extent. ALBERT, ‘A Lite BERT’ was made to make it as light as possible by reducing parameter size. The best advantage of Deep Learning for Sentiment Analysis Task is the step where preprocessing of data undergoes reduction. The only preprocessing required would be to convert to lower case. If machine learning methods like logistic regression with TF-IDF are being used, then it will be needed to remove the unnecessary words.

Transformer models, especially BERT transformed the NLP pipeline. These solved the problem of sparse annotations for text data. Instead of training a model from scratch now, it can simply fine-tune existing pre-trained models. But the sheer size of BERT makes it slightly unapproachable. It is very compute-intensive and time-consuming to run inference using BERT. ALBERT is a lite version of BERT which simultaneously maintains the performance and downsizes BERT. Researchers at Google Research and Toyota Technological Institute at Chicago published this model in a paper presented at ICLR 2020.

ALBERT – The Architecture

ALBERT can be simply defined as an encoder-decoder model with self-attention at the encoder end and attention on encoder outputs at the decoder end. The backbone of the architecture is mainly the multi-headed, multi-layer Transformer.

The main ‘mission’ of ALBERT is to reduce the number of parameters (a reduction up to 90%) using novel techniques while not taking a big hit to the performance. This compressed version now scales a lot better than the original BERT, improving the performance while simultaneously keeping the model small. It consists of several blocks stacked one above the other; each of these blocks contains a multi-head attention block and a Feedforward Network. There are few changes to the architecture mentioned in the case of ALBERT. Following are some of the techniques that ALBERT uses to achieve compression.

  1. Parameters’ Factorization
    Hidden layer representations are required to be large to accommodate the context information with the word-level embedding information. However, by increasing the hidden layer size, the number of parameters that blow up also increases. ALBERT interestingly factorizes these word-level input embeddings into lower dimensions.
  2. Cross-Layer Parameter sharing
    Although stacking independent layers increases the learning capacity of the models, it greatly increases the redundancy. ALBERT deals with this redundancy by sharing the parameters between groups of layers. This reduces the number of total parameters while keeping the number of layers constant.
  3. Inter sentence coherence loss
    This loss is used to improve the performance of the representations in the downstream tasks. BERT model is trained beforehand for the task of NSP (Next Sentence Prediction).

ALBERT is a very useful variant of BERT which is highly compact. It can enhance the efficiency of the performance of downstream language understanding tasks while keeping the computational overhead under an acceptable level for several applications.

Follow and connect with us on Facebook, Linkedin & Twitter