Table of Contents

Language modeling (for ASR)

This section provides links, tips, tricks and tutorials about language modeling for ASR. Initially, these are relevant to the projects within the PI group LS-LT. However, they could prove useful in other contexts too.

SRI LM Toolkit

(One of) the standard(s) in language modeling software is the SRI LM Toolkit. It is free for non-profit uses (e.g. academic research).
Official website: http://www.speech.sri.com/projects/srilm
Location v1.5.6 sources and binaries compiled for i686 32 bit architecture: /home/magam/mnt/smurf_home/t0/people/huijbreg/src-srilm/srilm-1.5.6
Location v1.7.1 compiled for Linux i686 64 bit architecture: tenslogin.let.kun.nl:/RAID/t0/people/mganzeboom/programs/srilm-1.7.1-src-n-bin

Compiling and installing

On a Linux like machine, compiling the SRI LM Toolkit is quite easy and only takes (under) 10 minutes or so. See the excellent INSTALL file in the above sources main directories.

Resource files and formats

For a simple context, like a non-smoothed bigram language model, SRI LM basically works with two files: a vocabulary text file (e.g. a lexicon/dictionary in ASR terms) and a corpus text file.
The vocabulary file is not necessarily required, but a good practice to limit the final language model to only the words available in the vocabulary. It should contain one (unique, case-sensitive) word per line.
The corpus text file contains the text from which a language model should be created. Every line should contain one utterance.

Converting resources from SPRAAK

It is highly likely that you have some resource files already available from other ASR software / projects. For instance resources used with the SPRAAK ASR toolkit.

Assuming that a SPRAAK lexicon file has its lines in the format 'bezinken [b@zInk@/b@zInk@n]', the file can be easily converted to a SRI LM vocabulary file by doing the following:

  1. Remove the lines up to and including the line that only contains the '#' character (i.e. the SPRAAK lexicon file header).
  2. Use this command on a Linux terminal calling the sed tool to remove the phonetic transcription column:
    sed -e ā€™s/\[.*\]ā„ā„gā€™ <spraak_lexicon_file> > <SRI_LM_toolkit_vocabulary_file>

The resulting file wil only contain one word per line and no header. Just as required by SRI LM.

The corpus text file should contain one utterance (e.g. word, sentence) per line. Linux tools that could come in handy to have a corpus with distinct lines of text: sort and uniq. For example the command 'sort <src_text_file> | uniq > distinct-corpus-file.txt will sort all lines of text in the source file, merge duplicates with 'uniq' and output them to 'distinct-corpus-file.txt'.

Then you the proper corpus text and vocabulary files to start creating a language model.

Commands to create a simple unsmoothed bigram text model

Also see this short and practical tutorial as part of a Linguistics course at UC San Diego here for some notes on smoothed en unsmoothed models.
When you have your vocabulary and corpus text files ready, the following command from SRI LM Toolkit will create a bigram language model and store it in the ARPA backoff N-gram format. This format can be used to convert to the SPRAAK binary format by the spr_lm_arpabo utility.

ngram-count -vocab <path_to_vocab_file> -text <path_to_corpus_file> -order <max_length_n-grams_(2_for_bigrams)> -addsmooth <0-9_add_smoothing_of_lm_0_for_none> -lm <path_to_store_n-gram_model_in_n-gram_text_format>

For an explanation of this command and the options used, please refer to the above tutorial or the man page.

To convert this text format to the binary format required by the SPRAAK ASR Toolkit, execute the following command (assuming you have the toolkit on the PATH):
spr_lm_arpabo -i <path_to_sri-lm_in_text_format> -o <path_to_store_SPRAAK_binary_lm>