OCR Post-correction and Enrichment

Search this type of tools here

OCR produces its best results from well-printed, modern documents. But historical documents contain a range of effects that can reduce accuracy of recognition: from poor paper quality, poor typesetting, damage or degradation of the original paper source, and text skew or warping due to age or humidity. In addition to this, content holding institutions will tend to have legacy data: text-based digitised material that was not originally created with OCR in mind.

This sort of material will produce unsatisfactory OCR accuracy and render digital material only partially discoverable and useable at best. IMPACT project therefore created a number of tools and modules that allow institutions and their users to correct and validate OCR text either prior to publication or after (by means of crowdsourcing).

These tools include a collaborative correction system, a document layout analysis module that can decode the structural information of text volumes, an application for interactive post-correction of OCR documents, and tools that can identify and extract named entities (people, locations and organisations) from a digital text file.