Text processing
Toolbox for lexicon building
In addition to the problems presented to OCR by the age and structural complexity of historical documents, full-text recognition is also hindered by a lack of appropriate lexicographical data. Put simply, words become obsolete or change their spelling over time, and a standard OCR dictionary will only recognise the most modern variants.
But historical language is also a challenge for users searching these collections of digitised documents. In Impact Centre of Competence, we also aim to improve searching in historical documents, allowing users to do so without knowledge of the details of spelling and inflection of a historical language.
Therefore, we use computational lexica, which contain historical variants (orthographical variants, inflected forms) that are linked to a corresponding dictionary form in modern spelling (known as a “modern lemma”).
Impact Centre provides guidelines and general tools for lexical data development from historical source material and tools to deploy the lexicon in enrichment (i.e. for retrieval).
Keep in mind the following things:
- Lexicon building for the digitisation and improved searching of historical documents only makes sense when the historical language differs substantially from the modern language. For Dutch, it already starts making sense for documents from the 19th Century. On the other hand 19th Century German does not differ greatly from modern German. Lexicon building in these situations has the most benefit for much older language periods like for instance the 16th Century.
- For lexicon building, one needs the assistance of a computational linguist and a historical linguist or a person mastering both skills.
Example Tool

Impact Tools – Lemmatization
Reducing historical word forms to one or several possible modern lemma’s (lemmatization).
Publications
- IMPACT deliverable D-EE2.8 Development and Use of Computational Lexica for OCR And IR on Historical Documents. A Cross-Language Perspective (February 2012) – abstract
- IMPACT deliverable D-EE2.6 Lexicon Cookbook (December 2011)
- Tom Kenter, Tomaž Erjavec, Maja Žorga Dulmin, Darja Fišer. 2012. Lexicon construction and corpus annotation of historical language with the CoBaLT editor. In Proceedings of the EACL Workshop on Language Technology for Cultural Heritage, Social Sciences, and Humanities, Avignon, France, April. Association for Computational Linguistics.
- Depuydt, K. and J. de Does, Computational Tools and Lexica to Improve Access to Text. Article in: Fons Verborum. Feestbundel voor prof. dr. A.M.F.J. (Fons) Moerdijk, aangeboden door vrienden en collega” bij zijn afscheid van het INL. Onder redactie van E. Beijk, L. Colman e.a. Leiden/Amsterdam, 2009, p. 187-199.
- Depuydt, K.Overview of the Language Work in IMPACT. IMPACT Final Conference 2011, 24-25 October, London, UK
- Fitzgerald, N.Screencast: An Introduction to the IMPACT Toolbox for Languages