This shows you the differences between two versions of the page.
| Next revision | Previous revision | ||
|
language_modeling [2015/03/31 15:15] mganzeboom created |
language_modeling [2015/03/31 16:40] (current) mganzeboom |
||
|---|---|---|---|
| Line 21: | Line 21: | ||
| Assuming that a SPRAAK lexicon file has its lines in the format ' | Assuming that a SPRAAK lexicon file has its lines in the format ' | ||
| - Remove the lines up to and including the line that only contains the '#' | - Remove the lines up to and including the line that only contains the '#' | ||
| - | - Use this command on a Linux terminal calling the sed tool to remove the phonetic transcription column: '' | + | - Use this command on a Linux terminal calling the sed tool to remove the phonetic transcription column:\\ '' |
| The resulting file wil only contain one word per line and no header. Just as required by SRI LM. | The resulting file wil only contain one word per line and no header. Just as required by SRI LM. | ||
| Line 30: | Line 30: | ||
| ==== Commands to create a simple unsmoothed bigram text model ==== | ==== Commands to create a simple unsmoothed bigram text model ==== | ||
| Also see this short and practical tutorial as part of a Linguistics course at UC San Diego [[http:// | Also see this short and practical tutorial as part of a Linguistics course at UC San Diego [[http:// | ||
| - | When you have your vocabulary and corpus text files ready, the following command from SRI LM Toolkit will create a bigram language model and store it in binary | + | When you have your vocabulary and corpus text files ready, the following command from SRI LM Toolkit will create a bigram language model and store it in the ARPA backoff N-gram |
| - | '' | + | '' |
| For an explanation of this command and the options used, please refer to the above tutorial or the [[http:// | For an explanation of this command and the options used, please refer to the above tutorial or the [[http:// | ||
| + | |||
| + | To convert this text format to the binary format required by the SPRAAK ASR Toolkit, execute the following command (assuming you have the toolkit on the PATH):\\ | ||
| + | '' | ||