The Functional Extension Parser – a rule-based system for flexible structural analysis

Lukas Gander of Universitäts- und Landesbibliothek Tirol (University and Regional Library Tyrol) outlines the concept behind the Functional Extension Parser: using an OCR engine’s output to create a structural map of a page or volume. OCR engines capture much more information than simple text: for instance, they contain information about text type and position. The Functional Extension Parser (FEP) will spot if, say, numerical values appear repeatedly at the bottom of a page and tag them as page numbers. Similar with Table of Content, chapter headings, indices and formulae. The FEP does this by the application of rules that have been designed to model a human’s intuitive understanding of book structures.