User Tools

Site Tools


language_modeling

Differences

This shows you the differences between two versions of the page.

Link to this comparison view

Next revision
Previous revision
language_modeling [2015/03/31 15:15]
mganzeboom created
language_modeling [2015/03/31 16:40] (current)
mganzeboom
Line 21: Line 21:
 Assuming that a SPRAAK lexicon file has its lines in the format '**bezinken [b@zInk@/b@zInk@n]**', the file can be easily converted to a SRI LM vocabulary file by doing the following: Assuming that a SPRAAK lexicon file has its lines in the format '**bezinken [b@zInk@/b@zInk@n]**', the file can be easily converted to a SRI LM vocabulary file by doing the following:
   - Remove the lines up to and including the line that only contains the '#' character (i.e. the SPRAAK lexicon file header).   - Remove the lines up to and including the line that only contains the '#' character (i.e. the SPRAAK lexicon file header).
-  - Use this command on a Linux terminal calling the sed tool to remove the phonetic transcription column: ''sed -e ’s/\[.*\]⁄⁄g’ <spraak_lexicon_file> > <SRI_LM_toolkit_vocabulary_file>''+  - Use this command on a Linux terminal calling the sed tool to remove the phonetic transcription column:\\ ''sed -e ’s/\[.*\]⁄⁄g’ <spraak_lexicon_file> > <SRI_LM_toolkit_vocabulary_file>''
 The resulting file wil only contain one word per line and no header. Just as required by SRI LM. The resulting file wil only contain one word per line and no header. Just as required by SRI LM.
  
Line 30: Line 30:
 ==== Commands to create a simple unsmoothed bigram text model ==== ==== Commands to create a simple unsmoothed bigram text model ====
 Also see this short and practical tutorial as part of a Linguistics course at UC San Diego [[http://idiom.ucsd.edu/~rlevy/teaching/2015winter/lign165/lectures/lecture13/lecture13_ngrams_with_SRILM.pdf|here]] for some notes on smoothed en unsmoothed models.\\ Also see this short and practical tutorial as part of a Linguistics course at UC San Diego [[http://idiom.ucsd.edu/~rlevy/teaching/2015winter/lign165/lectures/lecture13/lecture13_ngrams_with_SRILM.pdf|here]] for some notes on smoothed en unsmoothed models.\\
-When you have your vocabulary and corpus text files ready, the following command from SRI LM Toolkit will create a bigram language model and store it in binary format. This format can be used in SPRAAK.+When you have your vocabulary and corpus text files ready, the following command from SRI LM Toolkit will create a bigram language model and store it in the ARPA backoff N-gram format. This format can be used to convert to the SPRAAK binary format by the [[http://www.spraak.org/documentation/doxygen/doc/html/spr__lm__arpabo_8c.html|spr_lm_arpabo]] utility.
  
-''ngram-count -text corpus.txt -order -addsmooth 0 -lm corpus.lm''+''ngram-count -vocab <path_to_vocab_file> -text <path_to_corpus_file> -order <max_length_n-grams_(2_for_bigrams)> -addsmooth <0-9_add_smoothing_of_lm_0_for_none> -lm <path_to_store_n-gram_model_in_n-gram_text_format>''
  
 For an explanation of this command and the options used, please refer to the above tutorial or the [[http://www.speech.sri.com/projects/srilm/manpages/ngram-count.1.html|man page]]. For an explanation of this command and the options used, please refer to the above tutorial or the [[http://www.speech.sri.com/projects/srilm/manpages/ngram-count.1.html|man page]].
 +
 +To convert this text format to the binary format required by the SPRAAK ASR Toolkit, execute the following command (assuming you have the toolkit on the PATH):\\
 +''spr_lm_arpabo -i <path_to_sri-lm_in_text_format> -o <path_to_store_SPRAAK_binary_lm>''
  
language_modeling.1427807753.txt.gz · Last modified: 2015/03/31 15:15 by mganzeboom