Post processing and language technology in OCR

Impact CoCOptical Character Recognition

Jesse De Does from INL follows Katrien by introducing a post-processing Text and Error Profiling tool that looks similar to the IBM tool demonstrated by Niall earlier. This tool differs through the use of ‘text profiling’, whereby language and text is analysed on the basis of frequency and logic. It also allows for batch correction of same words.

The post-correction module also contains a memory bank of expected text recognition errors (‘f’ for ‘s’, for example) and allows batch correction at character level across a volume. This is known as ‘error profiling’. Good productivity results in small tests so far.

Jesse explains the technology that underlies the tool; essentially a hypothetical lexicon, which analyses patterns in a language and makes suggestions for words on the basis of likelihood. The same process can be run on a single volume, producing patterns of words based on their frequency within that volume.  The tool essentially brings together modern words, historical words and OCR recognitions and tries to establish a correct variant between the three.

Jesse’s presentation is here:


Jesse moves on to talk about the Functional Extension Parser, which has been developed by the University of Innsbruck to automatically extract structural information from a digital document. The FEP enhances the existing ability of OCR documents to segment texts, so that the table of contents, chapter headings, page numbers, footnotes, etc. can be automatically extracted. Even in cases where full-text is not offered, therefore, the document will still be navigable and searchable by structure.

By identifying and focussing on the print-space of a volume (as opposed to blank margins), the FEP also allows for for consistent display in print-on-demand or screen versions of the digital book. The FEP can output in most main bibliographic metadata standards (xml, ALTO, METS, TEI, etc) with export to image version as well.

Early testing of the tool has produced very promising results, with a number of print-on-demand books published based on FEP outputs as part of the experimental process. Jesse shows an example from the British Library’s IMPACT dataset, in which every structural object within the volume is clearly annotated and tagged with its function. Very few errors reported.

The presentation is here: