In addition to the problems presented to OCR by the age and structural complexity of historical documents, full-text recognition is also hindered by a lack of appropriate lexicographical data. Put simply, words become obsolete or change their spelling over time, and a standard OCR dictionary will only recognise the most modern variants.
But historical language is also a challenge for users searching these collections of digitised documents. In Impact Centre of Competence, we also aim to improve searching in historical documents, allowing users to do so without knowledge of the details of spelling and inflection of a historical language.
Therefore, we use computational lexica, which contain historical variants (orthographical variants, inflected forms) that are linked to a corresponding dictionary form in modern spelling (known as a “modern lemma”).
Impact Centre provides guidelines and general tools for lexical data development from historical source material and tools to deploy the lexicon in enrichment (i.e. for retrieval).
Keep in mind the following things:
- Lexicon building for the digitisation and improved searching of historical documents only makes sense when the historical language differs substantially from the modern language. For Dutch, it already starts making sense for documents from the 19th Century. On the other hand 19th Century German does not differ greatly from modern German. Lexicon building in these situations has the most benefit for much older language periods like for instance the 16th Century.
- For lexicon building, one needs the assistance of a computational linguist and a historical linguist or a person mastering both skills.