Learnings from building a classifier
Imbalance data problem is the most common scenario in real life. The predominant methods to handle such data are.
- Weighted loss to handle imbalance
This method duplicates the minority class in order to balance both classes. The drawback of this method is that it significantly increases the time for training and changes the inherent distribution of the real life data which we get at test time.
This method removes data samples from the minority class in order to balance both classes. The drawback of this method is that it omits data which the model could have learned from.
Weighted loss assigns the weights to cross entropy loss so that the model would penalize more mistakes in the minority class.
Example: Suppose you have a dataset with 10,000 positive and 40,000 negative examples. If weights for loss are [1,4], this means that loss is amplified by 4 times if the model made a mistake on this positive class.
Another variant of weighted loss is focal loss. This was introduced by Facebook for handling imbalance in image classification. One of the main USP of focal loss is its ability to distinguish between easy and hard samples.
Read more at — https://amaarora.github.io/2020/06/29/FocalLoss.html
Parameter efficient fine-tuning(PET) : There is whole new research in this space.
In PET, we freeze the parameters of the pre-trained model and learn a few extra parameters which would steer the model in solving the designated task.
One of the prominent PET methods is prefix-tuning, where we introduce an extra set of learnable parameters in the Key(k) and Value(v) matrix of the transformer layers. All other parameters of the pre-trained model are frozen.
This has many important advantages.
- We are only training 20% of the parameters, so less time take taken for training
- We can have stack of prefix tuned models(solving the same task, or diff task), while base model remains same.
- As we are not tampering the knowledge of pre-trained model(remember catastrophic fore-getting), It would be more generalisable.
- Accuracy is almost equivalent to full fine-tuning
Read more here :https://docs.adapterhub.ml/overview.html#prefix-tuning
There are good library called ‘adapter-transformer’ built on top of hugging-face focus only on PET
Generally, we have a linear layer on top of a pre-trained transformer which predicts the logits of classes based on the “CLS” token.
There are more efficient ways to learn the logits, below are some of them.
Utilizing Transformer Representations Efficiently
Explore and run machine learning code with Kaggle Notebooks | Using data from multiple data sources
Stabilisation of fine-tuning models.
Fine tuning the transformers can be unstable and there are chances of overfitting.
Some of the methods to avoid this are.
- Multi Dropout
- Layer wise learning rate decay.
- Grouped learning rate decay.
- Pre-trained Weight decay
Averaging Across models
When u have imbalance data, creating a training data can be challenging.
Models trained with different training data (a random sample from the full dataset) will make different mistakes and will converge to different local optimums.
To build a good generalised model , utilising full training data do the following steps.
- create a train data by sampling from full dataset
- Train the model for N epochs(chose based on data and your observation)
- save the last checkpoint
- Repeat this for M times(I used M=20)
Now u have M models which would have converged to different local optimum.
We can take average of weights all these models to create one final model.