Table of Contents

Evaluation of Forced aligners

This page contains information on forced alignment tools available here at CLST and from other third parties. They are compared and evaluated. It is in no way a formal evaluation, but more an evaluation of experiences and intuitions while working with them.

Montreal Forced Aligner

Montreal Forced Aligner (MFA): https://github.com/MontrealCorpusTools/Montreal-Forced-Aligner
“MFA is quite user-friendly and easy to use. There are several options, you can use the pre-trained models or you can train the models on your own data/speakers. Attached you will find some examples of why I am not too happy with MFA (especially with the pretrained models), but I am not quite sure how to get a better system. It is especially hard because some of the speakers reduce and the dictionary is not equipped to handle reduction. I have not formally evaluated MFA yet, so I cannot make a general statement about its performance. However looking at just a few files, it seems like that training on the speaker yields better results.” (Katherine Marcoux)
By default MFA trains the acoustic models to the stage of speaker-adapted triphones (HMM-GMM). The pretrained models on their website also seem to be HMM-GMM. I assume Katherine uses these pretrained models and trains speaker-adapted models using the pretrained as a basis?
They do have code to also train acoustic models based on DNNs (Kaldi's NNet2), but they say: “The DNN framework for the Montreal Forced aligner is operational, but may not give a better result than the alignments produced by the standard HMM-GMM pipeline. Preliminary experiments suggest that results may improve when the DNN model used to produce alignments is pre-trained on a corpus similar in quality (conversational vs. clean speech) and longer in length than the test corpus.” –> https://montreal-forced-aligner.readthedocs.io/en/latest/alignment_techniques.html#deep-neural-networks-dnns (Mario Ganzeboom)

Kaldi vs. HTK

Kaldi ASR toolkit: http://kaldi-asr.org
HTK ASR toolkit: http://htk.eng.cam.ac.uk
“KALDI outperforms HTK, in general, but the output quality largely depends on the adequacy of the orthographic input.” (Louis ten Bosch, Katherine Marcoux)