IMPACT language resources for historical Slovene

Tomaz ErjavecNews

The Jožef Stefan Institute, partner in the IMPACT project, was in charge of developing language resources for historical Slovene. With additional funding from the Google research award in the humanities “Developing Language Models of Historical Slovene”.

  1. A digital library / ground truth dataset, developed jointly with the National and University Library of Slovenia, also an IMPACT partner, and with the Scientific and Research Center of the Slovenian Academy of Sciences and Arts (partner in the Google project);
  2. A manually annotated corpus;
  3. A historical lexicon.

The digital library currently contains 160 units (16,000 pages, 3.7 million words), mostly from 1850-1918, comprising for the most part complete books and editions  of one newspaper. The manually annotated corpus contains 1,000 sampled pages  (240,000 words) from the digital library, where each word has a manually validated modern-day equivalent word-form and lemma, and PoS tag. The lexicon  contains 25,000 lemmas, with 52,000 modern-day words-forms and 73,000 historical equivalents, with examples of usage from the library.

All three resources are encoded in TEI P5 and are now available for searching and download via the CC-BY licence from the URL

More information is also available from the link below.

Related Files

Related Links