IMPACT provides tools for:
- Reducing historical word forms to one or several possible modern lemma’s (lemmatization).
- Expanding lemma lists with part of speech information to possible (“hypothetical”) full forms.
The purpose of lemmatization in IMPACT is improve retrieval in historical documents. The reverse lemmatization is used to create hypothetical lexicon content to be used mainly in lexicon building, but possibly also in OCR and information retrieval. Both tools are java command line application and are processing the data automatically.
Lemmatisation is the process by which a word form is returned to its basic or canonical form. (For instance, the lemma for ‘walking’ is the verb ‘walk’.) The lemmatization process relies on:
- A “witnessed” historical lexicon from which possible lemma’s are simply obtained by lookup.
- A reliable modern full form lexicon, possibly augmented by the expansion of a historical lemma list in modern spelling to hypothetical full form obtained by reverse lemmatization.
- A set of weighted patterns used to match historical words which were not found in 1) or 2) to wordforms in 2)
Reversed lemmatization is when from a canonical form the possible inflectional paradigm is created. (For instance, for the verb ‘to be’ this would mean ‘am, are, is, being, been, was, were’, or for the noun ‘dog’ it would mean ‘dog, dogs’).
- IMPACT deliverable D-EE2.6 Lexicon Cookbook (December 2011)
Tool for text digitisation
ABBYY FineReader Engine 10
The new SDK FineReader Engine 10, which was released in September 2010, contains a variety of technological improvements in terms of processing speed, recognition accuracy, simplification of development and new export formats.
Succeed training materials
Abbyy FineReader Engine 10
ABBYY FineReader is a widely used, well-documented commercial product for text recognition in images.
View more: Succeed training materials