About the PCMLBE
The PCMLBE, the Parsed Corpus of Modern Lak, is a preliminary attempt
at creating a growing corpus of syntactically annotated Lak texts.
The annotation of the texts takes the historical English parsed texts, such as the
PPCMBE, as a starting point,
and attempts to follow its annotation
guidelines
as close as possible.
The Lak language belongs to the North-East Caucasian family.
See the Ethnologue for details on the language.
It is a head-final language, but not so strict as Turkic languages.
Annotation efforts
Efforts are on the way to annotate a number of texts from different sources.
The first text is a transcription of an oral tale.
The following steps are taken in the annotation process:
- Collection of journals and newspapers (Dagestan)
- Text selection (manual)
- Sentence segmentation and Tokenization into Psdx (Cesax)
- Conversion from Psdx to FoLiA (Cesax)
- TODO: Indexing for CQL searches (BlackLab)
- TODO: publication under WhiteLab + BlackLabServer
- TODO: Interlinearisation and morphological tagging
- TODO: additional steps
|