The state-of-the-art

The existing OCR engines provide high-quality results for modern printed text, when the output is often above 99% correct. There is, however, a large room for improvement for their efficacy of OCR on historical text or manuscripts. For example, the finest technology, when applied to 16th century books, often leads to text where most of the words are recognized wrong. The reasons for this low performance are multiple:

  • Old fonts which are not adequately interpreted by modern OCR devices.
  • Bad quality of the original image (irregular spacing, stains, warped text, variable contrast, transparency).
  • Complex layout which makes it difficult to find the correct reading order.
  • Ancient vocabulary or orthography that poses additional difficulties to the disambiguation of alternative interpretations of the content.

The insufficient performance calls for the definition of objective quality measures which allow the users to compare different devices and the research and development community to guide their efforts. The evaluation of OCR technology o discussed in next section.

Images with irregular spacing between characters poses a great challenge to identify word boundaries (see, for example, the fifth line whose correct transcription reads "con los proximos, i cuan peſada i perezoſa para las co-")

Images with irregular spacing between characters poses a great challenge to identify word boundaries (see, for example, the fifth line whose correct transcription reads “con los proximos, i cuan peſada i perezoſa para las co-“)