NCSR Character segmentation
Segmentation is a major function in an OCR system. During this step, the main document components (text / graphic areas, text lines, words and characters or glyphs) are automatically extracted.
Traditionally, segmenting historical machine-printed documents has been tackled by the use of techniques that are mainly designed for contemporary documents.
As a result, several problems inherent in historical documents such as general low quality of the original volume; complex, dense and irregular layouts; artefacts not completely corrected during pre-processing (noise between characters, ink diffusion and text skew) seriously affect the segmentation and, consequently, the recognition accuracy of OCR. Furthermore, volume-specific rules are usually used for segmenting historical machine-printed documents. In the context of a mass digitisation workflow, this is unworkable and has necessitated the development of new approaches.
IMPACT introduces novel hierarchical segmentation models that allow the discrete problems of text block, text line, word and character segmentation to be addressed separately while at the same time allowing for interplay between all levels.
In IMPACT we introduce hierarchical segmentation models that allow the discrete problems of text block, text line, word and character segmentation to be addressed separately. In order to test IMPACT toolkits for each segmentation level, we assume as input the correct result from the previous level.
For the case of character segmentation, we present the output when starting from the correct word segmentation result. The algorithm is based on finding all possible segmentation paths by linking the feature points on the skeleton of the word and its background.
- IMPACT deliverable D-TR2: Segmentation and Classification Toolkit (July 2011)
- Gatos, B., A, Kesidis and A. Papandreou. “Adaptive Zoning Features for Character and Word Recognition“. ICDAR2011, 18-21 September, Beijing, China.
- Gatos, B., G. Louloudis and N. Stamatopoulos. “Greek Polytonic OCR Based on Efficient Character Class Number Reduction“. ICDAR2011, 18-21 September, Beijing, China.
- Vamvakas, G., N. Stamatopoulos, B. Gatos and S.J.Perantonis. “Automatic Unsupervised Parameter Selection for Character Segmentation”. DAS2010 Conference (9-11 June, Cambridge, USA)
- Gatos, B. IMPACT Tools Developed by NCSR. IMPACT Final Conference 2011, 24-25 October, London, UK
- Jerele, I.