CIS-LMU Text and Error Profiler
The Text and Error Profiler is software to analyse the OCR output from historical documents, using statistical modelling of document characteristics to improve OCR accuracy. It works by attuning itself to a particular document, rather than to common traits of printed documents from a certain era, resulting in a highly adaptive process. The tool uses its document-specific knowledge to allow the batch processing of erroneous words.
During the IMPACT Project (2008-2011), a working group at the University of Munich developed software to analyse the OCR output from historical documents, using statistical modelling of document characteristics to improve OCR accuracy.
The statistical models are as follows:
1. An analysis of the vocabulary of the document, focussing on potential variant spellings, and on the appearance of secondary or tertiary languages in the text.
2. An analysis of known language rules or patterns that can explain variant spellings.
3. An analysis of the OCR error rate, focussing on systematic errors (ie, those that appear to be introduced by the process of OCR itself).
4. Related to this, an analysis of which error patterns occur with high probability (ie, ‘i’ for ‘l’, ‘in’ for ‘m’, ‘n’ for ‘u’).
This analysis can be used for quality control, post-correction, retrieval, and a second run of the OCR that will adapt to the results produced by the Text and Error Profiler. The uses of the tool in post-correction are dealt with in the Post Correction Tool page.
For further information and contact details, please visit the website of the IMPACT working group at the Centrum für Informations-und Sprachverarbeitung, the University of Munich.
Evaluation of this tool was done, together with Post Correction Tool, by Bavarian State Library. The aim of the pilot was to measure how much these features can speed up the correction process, by comparing the CIS tool with all its features against a ‘baseline’ version that allowed corrections only on a word for word basis, without help by the profiler, comparable to other post-correction solutions available on the market.
Seven employees of the BSB took part in the pilot. They were identified on the basis of availability and expressed interest in the software. In order to provide a balance of outlooks on the tool, testers were recruited from a range of departments and units, to represent all branches working with OCR – from the Department for Manuscripts and Old Prints to the IT department and the Munich Digitization Center. Some of the testers have already had experience with other post-correction systems.
In the weeks before the pilot, the BSB IMPACT team selected two items that the volunteers would work on during the pilot phase, from the ‘Demonstration’ subset of the BSB’s Demonstrator Dataset.
The test was conducted in two rounds. In the first round, the tool ran in ‘baseline mode’. In the baseline mode, the functionality of the tool is similar to the functionality of any OCR engine, non-lexical words are marked by wiggly red lines underneath and the user has basic correction functions at his disposal, which include manual editing of words, as well as merging and splitting of words.
In the second phase, the tool was used with all features activated. To guarantee that users weren’t able to remember certain typical errors or features of a document, which might have influenced their correction speed, they were assigned a new document in the second round.
The results of the experiment are displayed in Figures 1 and 2. For each measure point (every ten minutes) the amount of errors corrected is shown. Empty cells in the table mean the user forgot to click the save button (User4 had to leave the test early in phase1). ‘User Full’ are the Users that had all the features available, while ‘Users Base’ are the ones that only had the basic features at hand.
In general it can be seen that a user that has the full mode available can achieve much better results than a user in the base mode, though the extent can vary quite much depending on the individual user.
In total the full-feature users average 6.4 corrections a minute, while the base users average 4.0 corrections.
- IMPACT Pilot report on Postcorrection Tools (June 2012) by KB National library of the Netherlands
- Reffle, U.
Analysis and Post-Correction of OCR-Processed Historical Documents. IMPACT Final Conference 2011, 24-25 October, London, UK.