Nijmegen parsed corpus of modern Chechen

NPCMC: Nijmegen Parsed Corpus of Modern Chechen

About the NPCMC

The NPCMC, the Nijmegen Parsed Corpus of Modern Chechen, is a preliminary attempt at creating a growing corpus of syntactically annotated Chechen texts. The annotation of the texts takes the historical English parsed texts, such as the PPCMBE, as a starting point, and attempts to follow its annotation guidelines as close as possible.

The Chechen language is a major representative of the North-East Caucasian family, and it is an agglutinative one. It has over sixteen grammatical cases, more than half of which are devoted to various locational and directional variants, where other languages would use prepositions and descriptions.

Annotation efforts

Efforts are on the way to annotate a number of texts from different sources. The first source is a set of newspaper and journal articles collected by the New Mexico State University around 2005-2006. The second source consists of texts gleaned from various books and places. The sources are referred to within each text.

The following steps are taken in the annotation process:

Breaking up into sentences (own software)

Tokenization (own software)

Part-of-speech estimation (uses extended Maciev dictionary combined with MBT)

Part-of-speech correction (manual process within Cesax)

Dependency parsing (uses Maltparser)

Dependency-to-constituency conversion (done within Cesax)

Constituent parse correction (manual process within Cesax)

Individual texts

There is a repository of syntactically annotated xml texts available. Major efforts are right now spent on the part-of-speech (POS) annotation of texts, since an increase in 'gold' standard annotated texts provides the MBT tagger with a larger training-set, and this, in turn, should lead to a better performance for off-the-shelf tagging of new texts. So, do have a look at the repository of POS-tagged xml texts.

Annotators

There have so far been two annotators involved.

References

If you use the NPCMC for your own work, please cite the following paper:

authors (to appear). Constructing a corpus of modern Chechen.

E.Komen@ru.nl | Last update: