The ground truth material for Polish consists of books published from 1617 to 1756, the Digital Library of Polish and Poland-Related News Pamphlets from 1570 to 1728.
Prior to IMPACT, there were practically no historical corpora of Polish, which caused various problems from the very beginning. One of them was the lack of standards for representing old Polish texts in Unicode, as several necessary characters and ligatures are not provided, neither by the Unicode proper nor by Medieval Unicode Font Initiative.
The primary resource was the Internet dictionary we shall refer to as the “Late Middle Polish dictionary”, its official name being “The dictionary of the Polish language of the sixteenth and the first half of the seventeenth century”.
The current lexicon consists of 9,909 lemmata, 24,977 word forms and 26,736 lemma/word forms combinations.
Also, a set of more than 100 rules for historical spelling of Polish developed for the IMPACT project are now available.
- IMPACT deliverable DEE3.13 Polish Lexicon Documentation (February 2012)
- Szafran, K. and M. Kresa. Glosa do leksykografii polskiej (New uses of historical dictionaries – in Polish), Glossa III lexicographic conference, 15″“16 September 2011, Warsaw, Poland
- Bień, Janusz S. (2014) The IMPACT project Polish Ground-Truth texts as a DjVu corpus. Cognitive Studies | Études Cognitives (14). pp. 75-84. ISSN 2080-7147