CIS-LMU Post Correction Tool (PoCoTo)


Search this type of tools:

Scenario


Interactive post-correction of OCRed documents. Using the information obtained by the Text and Error Profiler the whole correction process is adaptive to the document being processed. In this way, usually huge numbers of systematic errors can be corrected with just a few keystrokes.

Abstract


The Text and Error Profiler works by attuning itself to a particular document, rather than to common traits of printed documents from a certain era, resulting in a highly adaptive process. In particular, the tool uses its document-specific knowledge to allow the batch processing of erroneous words. By statistical analysis of the whole document, the system can identify correct substitute words with high confidence and huge numbers of systematic errors can be post-corrected with just a few keystrokes.

For further information and contact details, please visit the website of the IMPACT working group at the Centrum für Informations-und Sprachverarbeitung, the University of Munich.

A complete view of the post-correction graphical interface

A complete view of the post-correction graphical interface


Batch-processing of a systematically misspelled word

Batch-processing of a systematically misspelled word


Batch-processing of all errors based on the systematic error pattern n->u

Batch-processing of all errors based on the systematic error pattern n->u


Batch-processing of a systematically misspelled word

Batch-processing of a systematically misspelled word

Results


Evaluation of this tool was done, together with Text and Error Profiler, by Bavarian State Library. The aim of the pilot was to measure how much these features can speed up the correction process, by comparing the CIS tool with all its features against a ‘baseline’ version that allowed corrections only on a word for word basis, without help by the profiler, comparable to other post-correction solutions available on the market.

Seven employees of the BSB took part in the pilot. They were identified on the basis of availability and expressed interest in the software. In order to provide a balance of outlooks on the tool, testers were recruited from a range of departments and units, to represent all branches working with OCR – from the Department for Manuscripts and Old Prints to the IT department and the Munich Digitization Center. Some of the testers have already had experience with other post-correction systems.

In the weeks before the pilot, the BSB IMPACT team selected two items that the volunteers would work on during the pilot phase, from the ‘Demonstration’ subset of the BSB’s Demonstrator Dataset.

The test was conducted in two rounds. In the first round, the tool ran in ‘baseline mode’. In the baseline mode, the functionality of the tool is similar to the functionality of any OCR engine, non-lexical words are marked by wiggly red lines underneath and the user has basic correction functions at his disposal, which include manual editing of words, as well as merging and splitting of words.

In the second phase, the tool was used with all features activated. To guarantee that users weren’t able to remember certain typical errors or features of a document, which might have influenced their correction speed, they were assigned a new document in the second round.

The results of the experiment are displayed in Figures 1 and 2. For each measure point (every ten minutes) the amount of errors corrected is shown. Empty cells in the table mean the user forgot to click the save button (User4 had to leave the test early in phase1). ‘User Full’ are the Users that had all the features available, while ‘Users Base’ are the ones that only had the basic features at hand.

Table/Overview statistics:

TEP5

In general it can be seen that a user that has the full mode available can achieve much better results than a user in the base mode, though the extent can vary quite much depending on the individual user.

TEP6

In general it can be seen that a user that has the full mode available can achieve much better results than a user in the base mode, though the extent can vary quite much depending on the individual user.
In total the full-feature users average 6.4 corrections a minute, while the base users average 4.0 corrections.

Publications

Availability

Licence pending. For further information, please contact us.

OCR Post-correction and Enrichment

Tools for text digitisation