Tools for Lemmatization and Reverse Lemmatization
Produced by: Instituut voor Nederlandse Lexicologie (INL)
Abstract
IMPACT provides tools for:
- Reducing historical word forms to one or several possible modern lemma’s (lemmatization).
- Expanding lemma lists with part of speech information to possible (“hypothetical”) full forms.
The purpose of lemmatization in IMPACT is improve retrieval in historical documents. The reverse lemmatization is used to create hypothetical lexicon content to be used mainly in lexicon building, but possibly also in OCR and information retrieval. Both tools are java command line application and are processing the data automatically.
Lemmatisation is the process by which a word form is returned to its basic or canonical form. (For instance, the lemma for ‘walking’ is the verb ‘walk’.) The lemmatization process relies on:
- A “witnessed” historical lexicon from which possible lemma’s are simply obtained by lookup.
- A reliable modern full form lexicon, possibly augmented by the expansion of a historical lemma list in modern spelling to hypothetical full form obtained by reverse lemmatization.
- A set of weighted patterns used to match historical words which were not found in 1) or 2) to wordforms in 2)
Reversed lemmatization is when from a canonical form the possible inflectional paradigm is created. (For instance, for the verb ‘to be’ this would mean ‘am, are, is, being, been, was, were’, or for the noun ‘dog’ it would mean ‘dog, dogs’).
Publications
- IMPACT deliverable D-EE2.6 Lexicon Cookbook (December 2011)
Availability
The tool will be made available to the research community under the Apache Software License (ASL).
Would you like to try it?
lemmatizerOCR Post-correction and EnrichmentRelated content
Tool for text digitisation

ABBYY FineReader Engine 10
The new SDK FineReader Engine 10, which was released in September 2010, contains a variety of technological improvements in terms of processing speed, recognition accuracy, simplification of development and new export formats.
Succeed training materials

Abbyy FineReader Engine 10
ABBYY FineReader is a widely used, well-documented commercial product for text recognition in images.
View more: Succeed training materials