A Transformer model for inserting Vietnamese accent marks

Updates:  The trained model + instructions to use can now be downloaded from HF here.

In this post, I summarize how I made use of Huggingface’s transformer library to re-solve an NLP problem related to the Vietnamese language.

The problem

After learning about Hidden Markov models about 10+ years ago, I decided to apply it to building a small, but practical, toy that can auto insert accent marks for Vietnamese language.

In a nutshell, Vietnamese has some letters that have additional marks put on them. For ex, in addition to the letter ‘a’, the Vi alphabet also contains these “marked versions”: ă, â.

And for each of these 3  versions (a, ă, â), we can then put the 5 tones on them. An example for ‘ă’ will be:  ắ (acute),  ằ (grave), ẳ (hook), ẵ (tilde), ặ (dot).

Continue reading A Transformer model for inserting Vietnamese accent marks