Traditional OCR engines identify text by using a library of default fonts and typefaces. This means they can be easily confused by irregular characters, by artefacts of poor printing or poor image capture, or by the unexpected appearance of second languages in the text.
The IMPACT Inventory Extraction tool is a prototype with graphical user interface (GUI) that allows for the extraction of a complete list of characters from a document, without reference to a specific language dictionary or a library of fonts. The GUI allows users to assign properties to textual features within the tool itself, or to export the inventory to an OCR engine to allow for training on particular texts, and proper full-text recognition.
IMPACT deliverable D-TR4.1: Inventory Extraction Prototype (February 2011)
Colutto, S. and B. Gatos. “Efficient Word Recognition Using A Pixel-Based Dissimilarity Measure“. ICDAR2011, 18-21 September, Beijing, China.
Colutto, S. Introducing a New Image Dissimilarity Measure with an Application to Character Image Clustering in Degraded Historical Documents. DAS2010 Conference (9-11 June, Cambridge, USA).