Produced by: Jozef Stefan Institute
The reference corpus of historical Slovene goo300k contains the text from 1,100 pages sampled from the IMP collection with hand-validated linguistic annotation. The current version of the corpus is dcoumented in its TEI header. The larger IMP corpus contains the complete IMP text collection and is automatically annotated. On nl.ijs.si the goo300k and IMP corpora can be searched (as part of the “Slovene reference corpora” group) with two concordancers:
- NoSketch Engine
the open source version of the commercial SketchEngine
our interface to the IMS CWB backend, for the more adventorous.
Each word token (e.g. “lubesni”) in the corpora is annotated with:
- modernised form (“ljubezni”);
- lemma (“ljubezen”)
- PoS tag, (“Ncm”); the tagset is defined in the IMP morphosyntactic specifications.
- Tomaž Erjavec. 2012. The goo300k corpus of historical Slovene. Proceedings of the Eight International Conference on Language Resources and Evaluation (LREC’12), Istanbul.
The corpora are available in the source XML encoding and derived tabular (vertical) format, where the files are smaller and easier to process but also contain less information, at http://nl.ijs.si/imp/index-en.html#corpus