Through IMPACT, the state-of-the-art OCR engine ABBYY FineReader has been adapted to cope with the challenge of recognising historical fonts and layouts.
The new SDK FineReader Engine 10, which was released in September 2010, contains a variety of technological improvements in terms of processing speed, recognition accuracy, simplification of development and new export formats. New intelligent binarisation of the document images makes sure that more text is transferred to the OCR process. The new technology was also developed and tested with historical books provided by the other IMPACT partners.
Improved functionalities of the ABBYY FineReader Engine include:
- A new adaptive binarisation, which works better for documents with non-uniformly coloured background, noise and bleed-through from the opposite pages.
- Improvements in the segmentation algorithms, for example regarding picture detection. The segmentation of historic newspapers is also improved.
- Improvements in Fraktur recognition, using ground truth data made available during the IMPACT project.
- Improvements in the External Dictionary interface, which now allows using user-developed dictionaries to better recognise languages not supported by ABBYY’s technology.
- Native ALTO export format.
Example of an ALTO format export file
Check the Impact page at Abbyy historic OCR website: http://frakturschrift.de/en:projects:impact
- IMPACT Report on the comparison of Tesseract and ABBYY FineReader OCR engines (June 2012) by PSNC.
- Fuchs, M. ABBYY and OCR Improvements for IMPACT. IMPACT Final Conference 2011, 24-25 October, London, UK
- Optical Character Recognition (OCR) introduction & overview
- Press release:
- FineReader Engine 10 on ABBYY.com:
Tool for text digitisation
Abbyy Block Segmentation
Segmentation is a major function in an OCR system. During this step, the main document components (text / graphic areas, text lines, words and characters or glyphs) are automatically extracted.
Succeed training materials
Abbyy FineReader Engine 10
ABBYY FineReader is a widely used, well-documented commercial product for text recognition in images.
View more: Succeed training materials