Two periods have been tackled in IMPACT in distinct ways. For early nineteenth-century material, ABBYY has trained fonts to enable recognition of material in Church Slavonic fonts without diacritics. No lexica have been built for this period.
For the late nineteenth century (1882-1903), books and newspapers have been ground-truthed, and OCR and IR lexica have been built.
The current lexicon consists of 28,857 lexical entries developed in LeXtractor and the lexicon that can be automatically extracted from the manually validated tokens from the reference corpus. The size of the historical lexicon extracted from the manually validated corpus tokens is given the following: 26,148 word forms, 25,861 normalised, 21,115 modernized and 11,090 lemmata.
The lexicon is currently available as LeXtractor and TEI P5 XML and in the IMPACT database structure; in addition to the information in the table, it also contains the number of times a particular lexical item occurs in the corpus and the number of times it has been validated by hand, as well as the listing of all the corpus elements (page ids) in which the particular item has been attested. As these identifiers also contain the year of publication for each element, it is then easy to provide an estimated time period in which a particular lexical entry was used.
- IMPACT deliverable D-EE3.13 Bulgarian Lexicon documentation (February 2012)