Lessons from the Text-to-Text Transformer (T5) ablation studies

Reading the T5 paper was a pleasure for me and has helped me learn a great deal.

The paper was written in a way that’s easy to understand and follow. It used a long-format style (44 pages) and this has allowed the authors to explain things in detail.

And most importantly, its explicit focus was to do ablation studies to shed a clear light on what works and what doesn’t, pointing the way for future explorations.

A quick intro to T5

The framing as text-to-text enables it to solve both generation and classification problems using an exact same encoder-decoder architecture, and without the need for using different “heads” for different problems (like what was used with BERT). This is a really cool and ambitious problem modelling. To instruct the model on what to perform, a prefix is added to the input to signal what’s expected in the output (Ex: “translate English to German: <input>”)
In pretraining, the dropped-out tokens can be phrases (multiple continguous words); vs BERT: dropping out single words.

Below is a non-exhaustive list of key ablation studies in the paper.

Ablation studies on pretraining

No-pretraining performs much worse than with pretraining: this confirms the value of transfer learning. There’s an exception: NMT En-Fr.
Using a denoising objective (gaps filling) yields better results on downstream tasks than using a LM objective (next tokens prediction).
In denoising objective, the various ways of implementing gap filling yield similar performance. But 2 denoising approaches perform worse than the rest: deshuffling (shuffle words in a sent and predict orig sent) and “drop tokens” (given a sent with a dropped token, predict the dropped tokens).
- My guess is that these 2 formulations are just too confusing/hard for general training.
Corruption rate (what % of words are hidden for gaps filling) and span length (max length of hidden phrases) don’t affect performance much as long as they’re within reasonable ranges (<= 5 tokens for masked spans; corruption rate <= 25%).

Ablation studies on pretraining data

Performance drops if pre-training datasets aren’t filtered properly.
- For ex, a rule used by T5 was to use only text lines that end with a punctuation mark.
Pre-training on in-domain data helps improve performance on downstream tasks.
- Regarding this result, the authors noted “This is unsurprising but also unsatisfying if our goal is to pre-train a model that can rapidly adapt to language tasks from arbitrary domains”.
Pretraining using too many epochs (repeatedly sees the same data) may degrade performance (in T5 paper, started to degrade at epoch 64th).

Ablation studies on training strategy

The paper mentioned 2 approaches to reducing the # of params to be updated during finetuning (FT):
- “Adapter layers”: additional FF layers inserted and finetuned, while other params in the model fixed. This will significantly reduce the # of params to be updated during FT.
- “Gradual unfreezing”: initially all freezed (not updated) and then gradually unfreeze the layers, making more layers available for FT. This essentially still updates all params, but is a mechanism for increasing speed.

The result showed the adapter layers performed much worse than standard FT, while gradual unfreezing slightly degrades performance (unsurprisingly!)

Multi-task training and pre-training

Multi-task training: instead of pre-training then FT, in multi-task learning, all tasks are trained directly at once (as such, no pretraining, no FT step) by mixing the (self-supervised or supevised, see more below) training data for all tasks. The ablation study showed that multi-task learning approach underperformed the transfer learning approach.
Multi-task pre-training + finetuning: Since multi-task training didn’t yield good results, the authors experimented with including a FT step after the multi-task training.
The multi-task training now becomes the pre-training step, with the following experimented variations for the pre-training step. To clarify, after doing one of the following variations as the pretraining, all tasks will be finetuned on their own dataset.
1. Unsupervised dataset + supervised datasets for all tasks
2. As above, but leaving out the supervised data set for each target task, to see if the general training can help the model adapts to an unseen task.
3. Supervised multi-task datasets

Setting #1 yields comparable performance to the standard transfer learning pipeline. This is very much unsurprising b/c in addition to the unsupervised dataset, it also includes the supervised datasets for pretraining. From the angle that it didn’t outperform standard pretraining seems to indicate that the multi-task paradigm doesn’t mesh well with transfer learning.

Setting #2 degrades performance of #1 a little. The fact that the performance degraded just by leaving out the target supervised dataset in pretraining seems a little intuitive. But this result may not have any significant implication, since this target dataset will still be used in the FT step after this pretraining variation.

Setting #3: when followed with FT yields results that are even worse than just the multi-task pretraining (with no FT). My takeaway from this is that FT doesn’t seem to go well with pretraining using supervised datasets. This is likely because these supervised datasets are not large enough to produce an effect as a “general learning”, which could be achieved using massive datasets with self-supervised general tasks.

Ablation studies on scaling

Increasing the batch size vs increasing number of training steps: can be complementary approaches to train the model further (but with different memory and time requirements).
- Do both at the same time also worked.
4x model ~= 2x model + 2x training time: this possibly suggests that more training time can compensate for a smaller model. (But this probably only holds true when both the pretraining & FT datasets are really big).
Ensembling: is another orthogonal way to push the result a little further.
- Similarly pretrained but separately FT models are cheaper than completely training different models, but they also constitute good ensembles.

Final points

In the “Reflection” section, the T5 authors provide very good points for further consideration. Among them, below are the points that I found most interesting:

In ablation studies on model architecture (not covered above), the authors noticed that sharing params between encoder and decoder degraded performance just a little. As such, that seems to be a good way to halve the # of parameters w/o sacrificing a lot of performance.
- This result is really intriguing, b/c the encoder and decoder appear to be doing different things. So why didn’t tying their params affect its performance?
In the bullet point “More efficient knowledge extraction”: my takeaway from this is that the authors seem to signal that we’ve reached the limit of pretraining with text-denoising method (i..e, masked tokens prediction).
- This point probably means we’d see novel methods coming 😀
In most (if not all) ablation studies in the paper, pretraining on English texts didn’t achieve SOTA with translation tasks.
- This seems to be a (special) case where transfer learning doesn’t shine: when abundant labelled data are available.

P/s: If you’re interested in staying updated on Deep Learning new developments, follow me on Linkedin.

To receive auto email updates about new posts, please register using this form:

Spread the love