USAL Line and Word Segmentation
Segmentation is a major function in an OCR system. During this step, the main document components (text / graphic areas, text lines, words and characters or glyphs) are automatically extracted.
Traditionally, segmenting historical machine-printed documents has been tackled by the use of techniques that are mainly designed for contemporary documents.
As a result, several problems inherent in historical documents such as general low quality of the original volume; complex, dense and irregular layouts; artefacts not completely corrected during pre-processing (noise between characters, ink diffusion and text skew) seriously affect the segmentation and, consequently, the recognition accuracy of OCR. Furthermore, volume-specific rules are usually used for segmenting historical machine-printed documents. In the context of a mass digitisation workflow, this is unworkable and has necessitated the development of new approaches.
IMPACT introduces novel hierarchical segmentation models that allow the discrete problems of text block, text line, word and character segmentation to be addressed separately while at the same time allowing for interplay between all levels.
Segmentation in the context of document analysis is the partitioning of images into meaningful regions in order to relay them to the next processing step depending on their particular type.
In a typical text recognition scenario, text line segmentation followed by word segmentation are the next steps after text regions have been identified in the original. Segmented words are subsequently passed on to character segmentation which provides the input for the actual character classifier which is part of any OCR (Optical Character Recognition) software.
As for recognition-free document image segmentation (i.e. based only on geometrical features) there are two major approaches:
- Based on dividing image regions according to gaps (space)
- Based on merging image parts according to connected components / neighbouring objects.
Both approaches have been implemented and further developed for the initial text line and word segmentation toolkit in order to study and improve segmentation results of this stage.
- IMPACT deliverable D-TR2: Segmentation and Classification Toolkit (July 2011)
- Antonacopoulos, A. Image Enhancement, Segmentation, ExperimentalOCR. IMPACT Final Conference 2011, 24-25 October, London, UK
- Jerele, I.