Spelling Variation Tool


Compare with similar tools:

Abstract


The spelling of words in historical texts can differ widely from modern spelling. There are two general approaches to match different spellings. First, it is possible to use rewrite rules that transform words in one spelling to another. For historical dictionary which covers a large timespan, and in which variation is not limited to orthography, this approach is not satisfactory. Therefore, the use of statistics is often needed.

A commonly used statistic describing the match between two strings is the Levenshtein Distance15. It describes the number of character operations (inserting, deleting, changing) necessary to change one string into the other. In our application, we calculate the Levenshtein distance between the two strings while also taking the length of words in to account. Words of length 4-7 may have Levenshtein distance of 1, words length 8-11 may have a distance of 2, etc.

The IMPACT Spelling Variation Tool deals with historical spelling variation. It provides functionality to estimate a model of spelling variation from example data, and to match a historical word, or a list of historical words, to a list of ‘modern’ words (or historical words in normalized, modern-like spelling).

The tool is a java command line application.

Publications

Availability

The tool will be made available to the research community under the Apache Software License (ASL).

OCR Post-correction and Enrichment