IMP Slovene Corpora


The reference corpus of historical Slovene goo300k contains the text from 1,100 pages sampled from the IMP collection with hand-validated linguistic annotation. The current version of the corpus is dcoumented in its TEI header. The larger IMP corpus contains the complete IMP text collection and is automatically annotated. On the goo300k and IMP corpora can be searched (as part of the “Slovene reference corpora” group) with two concordancers:

Each word token (e.g. “lubesni”) in the corpora is annotated with:

The IMP XML schema for the corpora is based on the TEI P5 Guidelines.


Tomaž Erjavec. 2012. The goo300k corpus of historical Slovene. Proceedings of the Eight International Conference on Language Resources and Evaluation (LREC’12), Istanbul.


Jozef Stefan Institute


The corpora are available in the source XML encoding and derived tabular (vertical) format, where the files are smaller and easier to process but also contain less information, at