Digitisation is usually understood as the process of creating digital replicas of traditonal objects such a books, photographs or artifacts. In particular, the digitisation of textual content comprises usually four steps:
- Image capture (the creation of digital replicas, for instance, by scanning a printed map to obtain a digital picture in a graphics forma such as TIFF or JPG).
- Transcription (the conversion of raw images into automatically readable and searchable objects).
- Dissemination (the exploitation of the digital content created, for example, as a digital library).
- Long term preservation (the measures oriented to guarantee that the effort of creating digital content leads to collections which remain accessible in the future).
Optical character recognition (OCR, or automatic transcription) refers to the second step, the automatic transformation of printed text into digital documents. More precisely, OCR deals with the transformation of a digital image, for example, a digital picture of a book page in JPG format (like thissample), into a digital document whose text can be easily processed by a computer, such as a text file or searchable PDF (like this one).
OCR enriches the digital text since it makes the content indexable, and facilitates its transformation into other formats for dissemination or preservation.