The ground truth material for Polish consists of books published from 1617 to 1756, the Digital Library of Polish and Poland-Related News Pamphlets from 1570 to 1728.
Prior to IMPACT, there were practically no historical corpora of Polish, which caused various problems from the very beginning. One of them was the lack of standards for representing old Polish texts in Unicode, as several necessary characters and ligatures are not provided, neither by the Unicode proper nor by Medieval Unicode Font Initiative.
The primary resource was the Internet dictionary we shall refer to as the “Late Middle Polish dictionary”, its official name being “The dictionary of the Polish language of the sixteenth and the first half of the seventeenth century”.
The current lexicon consists of 9,909 lemmata, 24,977 word forms and 26,736 lemma/word forms combinations.
Also, a set of more than 100 rules for historical spelling of Polish developed for the IMPACT project are now available.
The Polish lexicon is freely available, but for distributing resources derived from the Late Middle Polish Dictionary the explicit permission of the Institute of Polish Language of Polish Academy of Sciences should be obtained. Also, distribution of resources derived from Morfeusz and SAM analyser should adhere respectively to their licenses.
For further information on licencing, please contact University of Warsaw IMPACT Group.
A search engine for the IMPACT Polish Texts is available on http://nkjp.pl/poliqarp/.
The rules for historical spelling of Polish developed for the IMPACT project are now available on the basis of GNU GPL license at https://github.com/jsbien/pol
Various scripts and data related to historical Polish are available from the IMPACT Centre of Competence github page.