PCMLBE: Parsed Corpus of Modern Lak

About the PCMLBE

The PCMLBE, the Parsed Corpus of Modern Lak, is a preliminary attempt at creating a growing corpus of syntactically annotated Lak texts. The annotation of the texts takes the historical English parsed texts, such as the PPCMBE, as a starting point, and attempts to follow its annotation guidelines as close as possible.

The Lak language belongs to the North-East Caucasian family. See the Ethnologue for details on the language. It is a head-final language, but not so strict as Turkic languages.

Annotation efforts

Efforts are on the way to annotate a number of texts from different sources. The first text is a transcription of an oral tale.

The following steps are taken in the annotation process:

  1. Collection of journals and newspapers (Dagestan)
  2. Text selection (manual)
  3. Sentence segmentation and Tokenization into Psdx (Cesax)
  4. Conversion from Psdx to FoLiA (Cesax)
  5. TODO: Indexing for CQL searches (BlackLab)
  6. TODO: publication under WhiteLab + BlackLabServer
  7. TODO: Interlinearisation and morphological tagging
  8. TODO: additional steps

Individual texts

There is a repository of syntactically annotated xml texts available.







E.Komen@ru.nl | Last update: