Inventory Extraction


Compare with similar tools:


Abstract


Traditional OCR engines identify text by using a library of default fonts and typefaces. This means they can be easily confused by irregular characters, by artefacts of poor printing or poor image capture, or by the unexpected appearance of second languages in the text.

The IMPACT Inventory Extraction tool is a prototype with graphical user interface (GUI) that allows for the extraction of a complete list of characters from a document, without reference to a specific language dictionary or a library of fonts. The GUI allows users to assign properties to textual features within the tool itself, or to export the inventory to an OCR engine to allow for training on particular texts, and proper full-text recognition.

Figure 1: Snapshot of the graphical user interface for the Inventory-Extraction-Tool on a Windows XP system

Figure 1: Snapshot of the graphical user interface for the Inventory-Extraction-Tool on a Windows XP system

Publications


IMPACT deliverable D-TR4.1: Inventory Extraction Prototype (February 2011)
Colutto, S. and B. Gatos. “Efficient Word Recognition Using A Pixel-Based Dissimilarity Measure“. ICDAR2011, 18-21 September, Beijing, China.
Colutto, S. Introducing a New Image Dissimilarity Measure with an Application to Character Image Clustering in Degraded Historical Documents. DAS2010 Conference (9-11 June, Cambridge, USA).

Availability

The tool is available under an open source licence and can be downloaded from the IMPACT Centre of Competence github page.

OCR Post-correction and Enrichment