Differences

This shows you the differences between two versions of the page.

--- language_modeling [2015/03/31 15:15]
mganzeboom created
+++ language_modeling [2015/03/31 16:40] (current)
mganzeboom
@@ Line 21: / Line 21: @@
 Assuming that a SPRAAK lexicon file has its lines in the format '**bezinken	[b@zInk@/b@zInk@n]**', the file can be easily converted to a SRI LM vocabulary file by doing the following:
   - Remove the lines up to and including the line that only contains the '#' character (i.e. the SPRAAK lexicon file header).
-  - Use this command on a Linux terminal calling the sed tool to remove the phonetic transcription column: ''sed -e ’s/\[.*\]⁄⁄g’ <spraak_lexicon_file> > <SRI_LM_toolkit_vocabulary_file>''
+  - Use this command on a Linux terminal calling the sed tool to remove the phonetic transcription column:\\ ''sed -e ’s/\[.*\]⁄⁄g’ <spraak_lexicon_file> > <SRI_LM_toolkit_vocabulary_file>''
 The resulting file wil only contain one word per line and no header. Just as required by SRI LM.
@@ Line 30: / Line 30: @@
 ==== Commands to create a simple unsmoothed bigram text model ====
 Also see this short and practical tutorial as part of a Linguistics course at UC San Diego [[http://idiom.ucsd.edu/~rlevy/teaching/2015winter/lign165/lectures/lecture13/lecture13_ngrams_with_SRILM.pdf|here]] for some notes on smoothed en unsmoothed models.\\
-When you have your vocabulary and corpus text files ready, the following command from SRI LM Toolkit will create a bigram language model and store it in binary format. This format can be used in SPRAAK.
+When you have your vocabulary and corpus text files ready, the following command from SRI LM Toolkit will create a bigram language model and store it in the ARPA backoff N-gram format. This format can be used to convert to the SPRAAK binary format by the [[http://www.spraak.org/documentation/doxygen/doc/html/spr__lm__arpabo_8c.html|spr_lm_arpabo]] utility.
-''ngram-count -text corpus.txt -order 2 -addsmooth 0 -lm corpus.lm''
+''ngram-count -vocab <path_to_vocab_file> -text <path_to_corpus_file> -order <max_length_n-grams_(2_for_bigrams)> -addsmooth <0-9_add_smoothing_of_lm_0_for_none> -lm <path_to_store_n-gram_model_in_n-gram_text_format>''
 For an explanation of this command and the options used, please refer to the above tutorial or the [[http://www.speech.sri.com/projects/srilm/manpages/ngram-count.1.html|man page]].
+To convert this text format to the binary format required by the SPRAAK ASR Toolkit, execute the following command (assuming you have the toolkit on the PATH):\\
+''spr_lm_arpabo -i <path_to_sri-lm_in_text_format> -o <path_to_store_SPRAAK_binary_lm>''

CLST-ASR

User Tools

Site Tools

Differences

Page Tools