Error rates and ground truth

The output of OCR engines often contains number of mistakes such as misspelled words or spurious characters.

In order to obtain a measure which is independent of the text size, the number of mistakes is usually normalized to the length of the expected content (the ground-truth text). The quotient between the number of mistakes and the text length is known as error rate.

 The error rate is usually calculated at two different levels::

  • Character error rate (CER).
  • Word error rate (WER).

It is worth to be remarked that word error rates are usually much higher than character error rates. For instance, a relatively small number of wrong characters, such as a 10% error rate, means that about one half of six-letter words will contain at least one wrong character (to be more precise, about 47% of 6-letter words will be wrong under the naive assumption that the distribution of errors among characters is homogeneous). Nevertheless, a good correlation between both measures can be expected (except, perhaps, for very high error rates) when both measures are used to compare different OCR engines.

Experiments suggest that humans are relatively intolerant to errors in the following sense: a word accuracy below 85% leads to a lower productivity if manual correction is applied to the output compared to that of typing entirely from scratch. For example,  it has been claimed (Rose Holley, How Good Can It Get? Analysing and Improving OCR Accuracy in Large Scale Historic Newspaper Digitisation Programs, D-Lib Magazine 15, 2009)  that good OCR accuracy means about 98-99% accurate, while an accuracy below 90% is poor.

Of course, error rates are relative to the expected output and, therefore, their estimation calls for a reference text. This reference is usually a digital document which has been manually produced to represent, as accurately as possible, the textual content of the original source. Such transcriptions are often called ground truth (a term borrowed from cartography), since they map the source to the true textual content. Although the true content seems to be  a clear concept, its creation often needs to deal with some subtle decisions. For example, should ligatures be represented as a single Unicode character or as two independent characters? Will blurred and distorted letters be transcribed as the one pretended to be or as they actually look-like? Of course, there is no definite answer to these type of questions since only the source image contains all the information and a meticulous creation of ground-truth can demand an effort to high to be worth.

An image with different ligatures (long s + i, long s + long s, long s + t) and a character et.

An image with different ligatures (long s + i, long s + long s, long s + t) and a character et.

A fragment with ligatures and an inverted r which looks like an i (fifth line).

A fragment with ligatures and an inverted r which looks like an i (fifth line).