Auto insert Vietnamese accent marks

A Transformer model for inserting Vietnamese accent marks

Huggingface’s transformer library is enabling engineers and developers to access the latest latest developments in AI research. Kudos to them.

Below, I summarize how I made use of their library to re-solve an NLP problem related to the Vietnamese language.

The problem

After learning about Hidden Markov models about 10+ years ago, I decided to apply it to building a small, but practical, toy that can auto insert accent marks for Vietnamese language.

In a nutshell, Vietnamese has some letters that have additional marks put on them. For ex, in addition to the letter ‘a’, the Vi alphabet also contains these “marked versions”: ă, â.

And for each of these 3  versions (a, ă, â), we can then put the 5 tones on them. An example for ‘ă’ will be:  ắ (acute),  ằ (grave), ẳ (hook), ẵ (tilde), ặ (dot).

(More info about Vietnamese alphabet can be found here. Also, if you know about French, this phenomenon is similar, just that in Vietnamese the accents are used more frequently)

As an example, from the original letter “a”, we’d have a total of 3 (“marked versions”) x 6 (5 tones + no-tone) = 18 accented forms.

(For our usage, inserting “accent marks” refer to both the process of producing the marked versions and putting tones to produce the final correct Vietnamese words.)

The problem can now be stated as: given a Vietnamese sentence / para that has been stripped away the accent marks, recover the original sentence.

This is a practical problem as some people (and Vietnamese learners) want to type the non-accent version and have the software auto-insert the accent marks for them, as typing w/o accents is faster, and esp. helpful if the Vietnamese typing software (for typing the accented versions) hasn’t been installed on the computer.

The original HMM solution

The original solution, which has been running on VietnameseAccent.com, is modeled as finding the hidden labels of the no-accent words, assuming the Markov property (= using bigrams).

The Viterbi algorithm is then used to find the optimal paths among all possible ones.

When this was developed, I didn’t formally pin its accuracy down to a number, although I did find it “quite accurate” through manual testing.

But to help compare this version with the Transformer version I plan to build, this time I decided to evaluate its accuracy using a test set of about 15k sentences.

And its accuracy is surprisingly decent: min = 0.0, max = 1.0, mean = 0.91, median = 0.94

The 2021 Transformer version

This problem of auto-inserting accent marks fits nicely into a token classification problem (similar to, for example, the classical problems of POS tagging and sentiment analysis in English).

In particular, each no-accent word is a “token” and all the possible accented versions of this word can be seen as its “labels”. The task is to select the correct label of this word in a given sentence context.

Using Huggingface’s Transformer lib, we can then make use of a BERT-like pretrained model on Vietnamese language and fine-tune it for our purpose.

The fine-tuning process is the standard one provided by the library so we’d mention it later.

For now, let’s go through the training results b/c this is where many interesting things were learned.

Fine-tuning Round 1

Training size Validation Test
~15k sents ~ 15k sents ~ 15k sents

(The available training data were much larger, but at that point I wasn’t aware that Google Colab supports a High-ram version, so 15k sents were all that it could take before it came crashing.)

Let’s keep the default values for all parameters and start training. After 20 mins of training, below are the results produced by Seqeval.

Epoch Training Loss Validation Loss Precision Recall F1 Accuracy
1 2.134300 0.735763 0.697687 0.701565 0.699621 0.773819
2 0.717500 0.580149 0.759825 0.763575 0.761695 0.821985
3 0.568200 0.528949 0.782944 0.786753 0.784844 0.840678

And here’s the result on the test data set:

overall_precision overall_recall overall_f1 overall_accuracy
0.784707 0.788418 0.786558 0.842166

The performance on test set was on par with the validation result so everything seems all right.

The only thing that wasn’t right is that this accuracy is still a lot lower than the HMM version (91% average accuracy)!

Anyway, I thought that was a fairly good result, given that this was my first attempt at this whole fine-tuning thing. So I proceeded to evaluate this model using the same way I evaluated the original model.

(To make things clearer, the above results were produced by seqeval in the context of the tokens classification task. The approach that was used for evaluating the original version was a more “user-friendly” version: given a sentence of n words, how many words are accented correctly, where words are split using spaces and removing puncs.)

When measured using the same method as for the original version, this is the result: min = 0.0, max = 1.0, mean = 0.78, median = 0.8

Oops. So it’s not 84%, but only 80%. (A question for readers: Why this (small) difference?)

Fine-tuning Round 2

In Colab high-ram mode, I could manage to get the training data up to 16x the previous round: so it’s now about 240k training sentences. The validation & test data sets remained the same.

The training time now took about 4h (instead of 20 mins for the prev round).

Below are the results on the validation and test sets:

Epoch Training Loss Validation Loss Precision Recall F1 Accuracy
1 0.098000 0.163699 0.942804 0.944179 0.943491 0.958879
2 0.104000 0.117573 0.955958 0.956947 0.956452 0.968497
3 0.071500 0.109042 0.961098 0.962062 0.961580 0.972209

And here’s the result on the test data set:

overall_precision overall_recall overall_f1 overall_accuracy
0.961755 0.962589 0.962172 0.972720

Wow, wow that’s a 10% increased in accuracy from the previous round: from 84.2% to now 97.2%, beating the HMM model’s avg accuracy of 91%!

This result was rather unbelievable, given the rather small training data set, so I went ahead to confirm its performance using the same metric for HMM.

And now I believe it: min = 0.0, max = 1.0, mean = 0.95, median = 1.0

Unbelievably, its median accuracy is 100% (vs HMM 94%) and the avg is 95% (vs HMM 91%)

This result has once again confirmed the superior power of deep learning models over traditional statistical methods, if sufficient training data are available. In theory, it’s not too hard to try to persuade oneself of this, but some concrete practical results are necessary to confirm it.

One last question that I wanted to ask was: can it do even better? What will happen if we keep training these models with more and more data?

Pushing more data

Below are 2 more training rounds with more data:

Training data size Mean Accuracy
(test set)
Median accuracy
(test set)
240k sentences
(Round 2 above)
0.95 1.0
960k sents 0.97 1.0
1200k sents 0.97 1.0

The performance did increase with more training data, but only marginally and 97% is about the best that it can get with more data.

Better results are possible, but probably not with more data, but via redesigning the way we model this problem.

Some comparisons: HMM vs Transformer

Training data size

As mentioned in a table above, the smallest training data set that produced the model with 97% accuracy had about 1 mil sentences. On average, there’s 10 words per sent, so in total the training set has 10 mil word tokens.

When I reviewed the data set that was used to produce bigram stats for the HMM model, it was 13 mil word tokens.

It’s interesting to see that in this case, the transformer model could out-perform the HMM model on the same training data set size.

Performance over long texts

To see if these models extend well to longer texts (>= 200 word tokens), I’ve reevaluated them over a test set of the same size, but with longer texts.

Over the longer text, the Transformer model is surprisingly resilient, with its performance drops only slightly:

Test data desc Mean accuracy Median accuracy
Random lengths
(with the avg being 10 word tokens per text)
0.97 1.0
Long texts
(>= 200 word tokens per text)
0.93 0.94

In contrast, the HMM model suffered quite heavily, having its performance dropped to below 30%.

After some digging, the cause was found to be due to missing bigram statistics when run over bigrams the model has never seen before. Longer texts exacerbates this problem.

Longer sequences are challenging for HMM decoding, even when smoothing has been applied. But this doesn’t seem to be of any issue to neural network models.

Inference speed

This is where the HMM model shines: from my experiments, its inference speed is 10 – 30 times faster than the transformer model. On a text with ~10 word tokens, the HMM model needs only 0.01s to produce a result, compared to 0.1s – 0.3s of the transfomer model.

The slower speed of neural network models in general is quite understandable, simply due to its massive amount of nodes.

(Actually, what surprises me more is how we could run all of those matrix multiplications in such as short amount of time.)

Acknowledgements

I first learned about Huggingface’s transformer library via a recommendation by Cong Duy Vu Hoang earlier this year.

After playing around with its library, I was looking for an application that I can experiment with the model fine-tuning process.

To do this, the key ingredients were the required large amount of labeled training data and access to GPUs, which is enabled by Google Colab (and its PRO version to run longer trainings).

One day, I stumbled upon this paper by Grammarly where it used transformer to do grammar corrections via tagging. The paper made me think of this accent marking problem that I’ve done before and I realized I could also formulate this problem as a tagging (token classification) problem as well!

Luckily, the training data necessary for this problem need no manual annotation: the labels are already in the natural texts. It makes for a perfect learning toy.

The pre-trained Vietnamese model I used was the XLM-Roberta Large and the training data were articles’ titles from Binhvq News corpus .

 

Photo credit: Photo by Karolina Nichitin from Unsplash

 

Stay updated about new posts

If you want to be notified of new posts via email, you can subscribe below, or Connect on Twitter

Spread the love
Subscribe
Notify of
guest
0 Comments
Inline Feedbacks
View all comments