The output of an OCR engine usually contains different types of errors:
- Misspelled characters (substitutions).
- Spurious symbols (insertions).
- Lost or missing text (deletions).
For example, the following line of text
has been read by our OCR engine as “be hath exerciled the ftrength” instead of the original “he hath exerciſed the ſtrength”. Clearly, there are three misspelled characters in the produced reading: the h in “he” has been misinterpreted as a b, and two long s have been read as I and f respectively.
However, the following line
was read as “For the Seat of Truth is not m theTongue, but in the Heart.”. In this example, the word “in” has been interpreted as “m” which can be understood as an “i” which has disappeared followed by an “n” confused with an “m”. Also the space between “the” and “Tongue” has been lost.
Finally, the text segment
was read as ” differing in this one thing from all others';” and an spurious apostrophe has appeared, apparently due to some dirt in the image.
In some applications, for example, those dealing with text typed by humans, swapping two characters is a common error. However, swaps are not a natural mistake in OCR, and therefore we will not consider them as a specific type but rather as equivalent to the concurrence of two basic operations: a deletion followed by an insertion. For example, the word “unclear” can be obtained instead of “nuclear” if the leading “n” is removed and then a new “n” is inserted before “clear”.
In technical texts, these three types of basic errors are often called insertion, deletion and substitution errors.