IMPACT provides an overview of the digitisation workflow which includes the following steps:
Tools for manipulating scanned images in order to improve the recognition results of OCR engines.
The various defects that can manifest themselves in document images are grouped into three broad categories of conditions that can be improved or eliminated in order to enhance the results obtained from scanned documents.
Image enhancement tools are:
- Binarisation and Colour Reduction
- Border Detection and Removal
- Geometric Correction: Page Curl & Arbitrary Warping
Segmentation is a major function in an OCR system. During this step, the main document components (text / graphic areas, text lines, words and characters or glyphs) are automatically extracted.
Traditionally, segmenting historical machine-printed documents has been tackled by the use of techniques that are mainly designed for contemporary documents.
As a result, several problems inherent in historical documents such as general low quality of the original volume; complex, dense and irregular layouts; artefacts not completely corrected during pre-processing (noise between characters, ink diffusion and text skew) seriously affect the segmentation and, consequently, the recognition accuracy of OCR. Furthermore, volume-specific rules are usually used for segmenting historical machine-printed documents. In the context of a mass digitisation workflow, this is unworkable and has necessitated the development of new approaches.
IMPACT introduces novel hierarchical segmentation models that allow the discrete problems of text block, text line, word and character segmentation to be addressed separately while at the same time allowing for interplay between all levels.
- IMPACT deliverable D-TR2: Segmentation and Classification Toolkit (July 2011)
OCR (Optical Character Recognition) is defined as automatic transcription of the text represented on an image into machine-readable text.
The term OCR is usually applied to printed material, but it can be also used for typewritten or handwritten material. For IMPACT project, the main focus was on printed material and partially typewritten materials from before 1850.
OCR is always created with an output format in mind. These formats include raw text (as above), archival and research-oriented XML formats such as ALTO and PAGE, and formats for wider dissemination and ease of public use, such as PDF or RTF.
The OCR process itself includes several steps, such as:
- Binarisation – a step in which the image is converted into bitonal (black and white) format which helps OCR engine to proceed with recognition of characters.
- Geometrical correction, e.g. dealing with page curl.
- Segmentation – the automatic extraction of the main document components (text / graphic areas, text lines, words and characters or glyphs).
- Pattern recognition – the real “reading” of a text on the image (interpreting and determining characters).
- Comparison of the recognition output to a lexicon. This may correct a low reliability recognition rate or reinforce the recognition rate, and helps resolving ambiguities.
OCR Post-correction and Enrichment
OCR produces its best results from well-printed, modern documents. But historical documents contain a range of effects that can reduce accuracy of recognition: from poor paper quality, poor typesetting, damage or degradation of the original paper source, and text skew or warping due to age or humidity. In addition to this, content holding institutions will tend to have legacy data: text-based digitised material that was not originally created with OCR in mind.
This sort of material will produce unsatisfactory OCR accuracy and render digital material only partially discoverable and useable at best. IMPACT project therefore created a number of tools and modules that allow institutions and their users to correct and validate OCR text either prior to publication or after (by means of crowdsourcing).
These tools include a collaborative correction system, a document layout analysis module that can decode the structural information of text volumes, an application for interactive post-correction of OCR documents, and tools that can identify and extract named entities (people, locations and organisations) from a digital text file.
- IIMPACT Pilot report on Postcorrection Tools (June 2012) by KB National library of the Netherlands
In addition to the problems presented to OCR by the age and structural complexity of historical documents, full-text recognition is also hindered by a lack of appropriate lexicographical data. Put simply, words become obsolete or change their spelling over time, and a standard OCR dictionary will only recognise the most modern variants.
But historical language is also a challenge for users searching these collections of digitised documents. In Impact Centre of Competence, we also aim to improve searching in historical documents, allowing users to do so without knowledge of the details of spelling and inflection of a historical language.
Therefore, we use computational lexica, which contain historical variants (orthographical variants, inflected forms) that are linked to a corresponding dictionary form in modern spelling (known as a “modern lemma”).
Impact Centre provides guidelines and general tools for lexical data development from historical source material and tools to deploy the lexicon in enrichment (i.e. for retrieval).
Keep in mind the following things:
- Lexicon building for the digitisation and improved searching of historical documents only makes sense when the historical language differs substantially from the modern language. For Dutch, it already starts making sense for documents from the 19th Century. On the other hand 19th Century German does not differ greatly from modern German. Lexicon building in these situations has the most benefit for much older language periods like for instance the 16th Century.
- For lexicon building, one needs the assistance of a computational linguist and a historical linguist or a person mastering both skills.
- IMPACT deliverable D-EE2.8 Development and Use of Computational Lexica for OCR And IR on Historical Documents. A Cross-Language Perspective (February 2012) – abstract
- IMPACT deliverable D-EE2.6 Lexicon Cookbook (December 2011)
- Tom Kenter, Tomaž Erjavec, Maja Žorga Dulmin, Darja Fišer. 2012. Lexicon construction and corpus annotation of historical language with the CoBaLT editor. In Proceedings of the EACL Workshop on Language Technology for Cultural Heritage, Social Sciences, and Humanities, Avignon, France, April. Association for Computational Linguistics.
- Depuydt, K. and J. de Does, Computational Tools and Lexica to Improve Access to Text. Article in: Fons Verborum. Feestbundel voor prof. dr. A.M.F.J. (Fons) Moerdijk, aangeboden door vrienden en collega” bij zijn afscheid van het INL. Onder redactie van E. Beijk, L. Colman e.a. Leiden/Amsterdam, 2009, p. 187-199.
IMPACT project developed a series of software modules that will evaluate the performance of each stage in an OCR production workflow from image enhancement to segmentation, up to the success of the OCR output itself.
In addition, IMPACT created a Metrics Toolkit: a web application for the statistical evaluation of the outputs of different digitisation workflows. The tool compares the OCR result from specific workflows against validated text and layout information (“ground truth”) and demonstrates the enhancement achieved by different configurations of tools.