OCR engines read the text contained in one image (usually a single page) and create a new file with the textual content.
Humans can do it very easily in most cases (at least, with printed text) but computer cannot always provide good results. Why? In addition to the reasons exposed in section 1.3 (which are shared by most humans) there is a number of reasons which are specific to OCR machinery:
- Computers lack the general knowledge which allows humans to correctly interpret some difficult texts. For example, a basic syntactical wisdom can discard that a stain is dot (as in “She came to. Paris.”), and semantic considerations can solve the ambiguity in a word.
- OCR software has to operate in a wide range of circumstances (scripts, languages, fonts) with a reasonable accuracy. Although the adaptation of OCR software to a particular environment could lead to improved results, this requires often advanced skills which can be expensive to hire.