How is the OCR error rate computed? In order to answer this question, the following issues will be discussed below:
- Minimal number of errors
- White space
- Case folding
- Character encoding
Minimal number of errors
Computing an error rate is not so simple when the number of mistakes grows. For instance, one could argue that the character error rate between “ernest” (reference) and “nester” (output) is 6 since no character in the reference appears in “nester” at the required position. However, two identical subsequences “nest” can be aligned in both words. A subsequence of a word is any string of characters that can be obtained by just removing some of the original ones (no sorting allowed since the order of letters is essential for correct spelling). For example, “nest” is the result when the two initial characters are removed in “ernest” and “rnst” is the subsequence containing only consonants in the original word.
Clearly, there is more similarity between this pair of words than that conveyed by the 100% CER: indeed it is enough to remove 2 extra characters (“er”) at the beginning of the the first word and then insert the two trailing characters (“er”) in order to obtain the second word. With this 4 operations, the CER becomes about 67%. In contrast, the alignment of just both “er” subsequences would require 8 operations and, therefore, lead to a CER of 133%.
Therefore, the CER is computed with the minimum number of operations required to transform the reference text into the output (a number which is known as the Levenshtein distance; essentially, the larger the number, the more different both texts are). However, finding the optimal set of basic operations is not trivial, at least for lengthy and noisy outputs. Fortunately, there are computational procedures which can evaluate the minimal number automatically (see, for instance the online demo here).
As said before, in some applications swaps are a natural source of mistakes. The generalization of the Levenshtein distance including swaps of contiguous characters as a fourth basic operation (in addition to insertions, substitutions and deletions) is known as the Damerau-Levenshtein distance.
A standard definition of the word error rate is:
WER = (iw + sw + dw) / nw
where nw is the number of words in the reference text, sw is the number of words substituted, dw the number of words deleted and i the number of words inserted required to transform the output text into the target (among all possible transformations, that one minimizing the sum iw + sw + dw is chosen). Note that the number cw of correct words is cw = nw – dw – sw.
The character error rate is defined in a similar way as:
CER = (i + s + d) / n
but using the total number n of characters and the minimal number of character insertions i, substitutions s an deletions d required to transform the reference text into the OCR output.
As seen befiore, the number of mistakes can be larger than the length of the reference text and lead to rates large than 100% (for example, a ground-truth document with 100 character and a lengthy OCR output which contains 120 wrong characters give a 120% error rate and a negative accuracy). Sometimes the number of mistakes is divided by the sum (i + s + d + c ) of the number of edit operations (i + s + d) and the number c of correct symbols which is always larger then the numerator.
Blank or white space plays an important role in any text, since it separates words (e.g., it is quite different to read “mangoes” instead of “man goes”), and it must be therefore considered while computing the character error rate. In order to understand a text, it is however not so relevant the length of the white space. Therefore, contiguous spaces are often considered to be equivalent to a single one, that is, the CER between “were wolf” (with a double blank between both words) with respect to the reference text “werewolf” is 0.125.
It is worth however to note that white space is not a character like the others, since when the alignment between texts allows to replace white space with a printable character, it leads sometimes to weird results. For example, when comparing “bad man” and “batman”, the deletion of d and substitution of the white-space with t has identical Levenshtein cost (two operations) than the deletion of the white-space plus the replacement of the d with a t (also two operations). However, the second option seems a more natural transformation. This behavior can be minimized if the substitution of white space with printable characters is not allowed.
In contrast, the computation of word error rated does not need to take care of white space since the text is pre-processed by a word tokenizer and blanks only matter in this step. Some care must be just taken when defining a word, since the definition may have a slight influence in word counting. For example, very often “I.B.M.” will be split into three different words by many tokenizers.
Clearly, case matters in OCR: for example, “white house” and “White House” will probably refer to two different objects. Therefore, the CER when “white house” is obtained instead of “White House” is not 0 but rather 0.18 (18%). However, in the context of information retrieval (whose objective is to find all attestations of a word or expression), case is often neglected (so that capitalized words in titles are equivalent to non-capitalized ones in normal paragraphs). This suggests that WER should not count as mistakes the substitution of one uppercase letter by the correspondong lowercase one.
Unfortunately, there is a large number of alternative encodings for text files (such as ASCII, ISO-8859-9, UTF8, Windows-1252) and the tool makes its best to guess the format used for every input file.
XML files customarily include the declaration of the encoding in the very fist line:
<?xml version="1.0" encoding="utf-8"?>
and this provides a method to detect automatically the character set which was employed to create the file.
Also HTML files often contain metainformation like the following line
<meta http-equiv="Content-Type" content="text/html; charset=UTF-8">
which informs on the character set used.
However, these metadata are not compulsory and may be missing. Furthermore, in the case of text files (such as those whose nae end with the extension .txt) there is no information about encoding. In all cases where the encoding is not declared in the document, the software makes a guess which should work in most cases (at least for long enough files) but can lead occasionally to unexpected results.