About the PCML
The PCML, the Parsed Corpus of Modern Lezgi, is a preliminary attempt
at creating a growing corpus of syntactically annotated Lezgi texts.
The annotation of the texts takes the historical English parsed texts, such as the
PPCMBE, as a starting point,
and attempts to follow its annotation
guidelines
as close as possible.
The Lezgi language belongs to the North-East Caucasian family.
See the Ethnologue for details on the language.
It is a head-final language, but not so strict as Turkic languages.
Annotation efforts
Efforts are on the way to annotate a number of texts from different sources.
The first text is a transcription of an oral tale.
The following steps are taken in the annotation process:
- Breaking up into sentences (FLEX)
- Tokenization (FLEX)
- Interlinearisation and morphological tagging (FLEX)
- Transformation from Flex to FoLiA (automatically using
Cesax)
- Transformation from FoLiA to Psdx (automatically using Cesax)
- Dependency parsing (uses Maltparser trained on related language)
- Dependency-to-constituency conversion (done within
Cesax)
- Constituent parse correction (manual process within
Cesax)
|