IMP Slovene Corpora - IMPACT Centre of Competence

ABSTRACT

The reference corpus of historical Slovene goo300k contains the text from 1,100 pages sampled from the IMP collection with hand-validated linguistic annotation. The current version of the corpus is dcoumented in its TEI header. The larger IMP corpus contains the complete IMP text collection and is automatically annotated. On nl.ijs.si the goo300k and IMP corpora can be searched (as part of the “Slovene reference corpora” group) with two concordancers:

NoSketch Engine
the open source version of the commercial SketchEngine
CUWI
our interface to the IMS CWB backend, for the more adventorous.

Each word token (e.g. “lubesni”) in the corpora is annotated with:

modernised form (“ljubezni”);
lemma (“ljubezen”)
PoS tag, (“Ncm”); the tagset is defined in the IMP morphosyntactic specifications.

The IMP XML schema for the corpora is based on the TEI P5 Guidelines.

PUBLICATIONS

Tomaž Erjavec. 2012. The goo300k corpus of historical Slovene. Proceedings of the Eight International Conference on Language Resources and Evaluation (LREC’12), Istanbul.

PRODUCED BY

Jozef Stefan Institute

LICENCING

The corpora are available in the source XML encoding and derived tabular (vertical) format, where the files are smaller and easier to process but also contain less information, at http://nl.ijs.si/imp/index-en.html#corpus