Text recognition (OCR)

Search this type of tools here

OCR (Optical Character Recognition) is defined as automatic transcription of the text represented on an image into machine-readable text.

As an example please see below part of image and corresponding OCR result:

Input (Image file):

OCR

Output (machine-readable text):

OCR2

The term OCR is usually applied to printed material, but it can be also used for typewritten or handwritten material. For IMPACT the main focus is on printed material and partially typewritten materials from before 1850.

OCR is always created with an output format in mind. These formats include raw text (as above), archival and research-oriented XML formats such as ALTO and PAGE, and formats for wider dissemination and ease of public use, such as PDF or RTF.

The OCR process itself includes several steps, such as:

  • Binarisation – a step in which the image is converted into bitonal (black and white) format which helps OCR engine to proceed with recognition of characters.
  • Geometrical correction, e.g. dealing with page curl.
  • Segmentation – the automatic extraction of the main document components (text / graphic areas, text lines, words and characters or glyphs).
  • Pattern recognition – the real “reading” of a text on the image (interpreting and determining characters).
  • Comparison of the recognition output to a lexicon. This may correct a low reliability recognition rate or reinforce the recognition rate, and helps resolving ambiguities.