OCR Prototype for recognising typewritten documents incorporating background knowledge about the specific features of this type of documents.
Typewritten documents differ from machine printing in that each character is produced independently, with ink transferred to paper proportionate to the force of the original keystroke.
The result is non-uniformity in the intensity of the ink, with some characters being too weakly represented for OCR and others being so strongly impressed that they blur beyond legibility. These problems are exacerbated in carbon copies (of which many exist as primary sources).
Adding to this problem is the administrative or casual nature of the content of most typewritten documents. They tend to use less natural language than documents printed for publication; instead consisting more of names, abbreviations and numbers, which render standard lexicon-aided recognition less useful. In order to meet these challenges, IMPACT has developed a new approach that integrates background knowledge about this special type of document.
The IMPACT Typewritten OCR prototype works by extracting and computing individually enhanced glyph/character images from original material prior to any classification. This means that the focus is on enhancing individual keystrokes towards recognition, rather than the traditional focus on block, line and word segmentation. Figure 1 shows an overview of the architecture of the Typewritten OCR prototype.
- IMPACT deliverable D-TR4.2: Typewritten OCR Prototype (February 2011)
- Pletschacher, S. OCR for Typewritten Documents. IMPACT Final Conference 2011, 24-25 October, London, UK