Reading the T5 paper was a pleasure for me and has helped me learn a great deal.
The paper was written in a way that’s easy to understand and follow. It used a long-format style (44 pages) and this has allowed the authors to explain things in detail.
And most importantly, its explicit focus was to do ablation studies to shed a clear light on what works and what doesn’t, pointing the way for future explorations.
A quick intro to T5
- The framing as text-to-text enables it to solve both generation and classification problems using an exact same encoder-decoder architecture, and without the need for using different “heads” for different problems (like what was used with BERT). This is a really cool and ambitious problem modelling. To instruct the model on what to perform, a prefix is added to the input to signal what’s expected in the output (Ex: “translate English to German: <input>”)
- In pretraining, the dropped-out tokens can be phrases (multiple continguous words); vs BERT: dropping out single words.
Below is a non-exhaustive list of key ablation studies in the paper. Continue reading Lessons from the Text-to-Text Transformer (T5) ablation studies