Functional Extension Parser


Compare with similar tools:


Scenario


In a mass digitisation workflow: to extract structural features (TOC, TOC_entries, headings, headers, footnotes, etc.) from digitised books. The extraction can be done in a fully automated way, provided are METS/ALTO files.
As an editing and correction tool for the improvement of already digitised books. Apart from METS/ALTO, enhanced PDFs and eBooks (EPUB) can also be generated.
Since the tool is completely web-based, crowdsourcing applications can also be built on basis of FEP. The tool would allow users to e.g. post- process a book digitised by Google and receive enhanced PDFs or EPUB files.
As a generic document understanding tool where rules can be adapted to all kinds of docu- ments: newspapers, journals, index cards, etc.

Abstract

Introduction

The Functional Extension Parser (FEP) is a Document Understanding Software tool capable of decoding layout elements of books.

Based on the output of Optical Character Recognition, layout elements such as page numbers, running titles, headings, and footnotes are detected and annotated. The FEP has been trained and tested on several datasets comprising books from the 18th to the 20th century from several European libraries. The final release is a production tool ready to be used for the structural annotation of digitised documents.

Why do we need structural information?

Structual tagging of documents offers a large number of benefits. Full-text search can be done in a much more focused way, e.g. search results in footnotes may be separated from search results in the running text.

Browsing through documents for end-users is supported by displaying the correct page number or by linking a table of contents entry with the page or headline within a book. Using documents on mobile devices in a convenient way will also require structural tagging, for example by masking running titles or footnotes. Last but not least, structural tagging will help in reprinting a digitised book due to the detection of the so-called print space. This structural feature allows to exactly place the elements on a page and to generate a perfect “look and feel” of the reprint.

Print space analysis

The functionality of the print space analysis can be explained best by means of a comprehensive example. Figure 1 displays two images which could serve as input for the FEP system.

The left image belongs to a page with an even page number whereas the right image belongs to a page with an odd page number. The images have some other interesting properties.

The left image is much broader than the right image which might occur during the scanning process since the cropping is usually done manually. The outer margin of the left image is extremely large compared to the margins of the right image. Another difference between the two images is that the left image belongs to the last page of a chapter and is therefore not completely filled with text. These differences make the images unusable for a reprint of a book and even inappropriate for the usage within a digital library application.

Figure 1: Input images for the print space analysis

Figure 1: Input images for the print space analysis

The first step of the print space analysis has to determine the text which belongs to the print space. This has to be done locally for each single image without information of the other images. To detect the “local” print space correctly, all the rules regarding the definition of the historical print space have to be taken into account (cf. D-EE4.2). In the specific example the line containing the page number belongs to the print space because of the existence of a page header. Figure 2 shows the local detected print spaces for both images within blue rectangles.

Figure 2: Print spaces after the detection step

Figure 2: Print spaces after the detection step

One of the main principles of the historical print space is that the print space has the same size for each page throughout the whole book. Since the local detected print spaces of the two images do not have the same size, it is obviously the case that at least one of the detected print spaces must be incorrect. To identify and correct such mistakes a second step during print space analysis is needed. The challenge for the second step during print space analysis is it to compare all locally detected print spaces with each other and to decide which one of them fits the original print space best.

After the decision about the original print space all the print spaces having differences with the original print space need to be corrected. In our example the print space of the right image was considered to be the original print space within the book. By comparing the locally detected print space from the left image with the original print space it becomes clear that the left print space has the same width as the original one but has to be enlarged towards the bottom. Identifying the direction in which the print space has to be changed in order to fit the original print space is one of the most challenging tasks of the print space analysis.

Due to the fact that the margins within the images do not always have the same size or, even worse, that the images itself do not have the same size, the detection of the correct direction which has to be edited is error prone and may not always fit the reality. Figure 3 displays the result after the successful execution of the reconstruction step.

Figure 3: Print spaces after the reconstruction step

Figure 3: Print spaces after the reconstruction step

Both print spaces have now exactly equal size, which means that the images have been normalized and are now ready for additional refinements for the different use cases of the images.

During the last step, called refinement step, margins are added to the reconstructed print spaces. The size of the margins which have to be added are strongly dependent on the area of application. A digital library application for example needs the images with the same margins on the left and on the right side since the images are displayed on a screen. When reprinting a book then the original margins of the book have to be reconstructed. The margins within a book were originally set on the basis of the historical print space. Figure 4 shows the images with the reconstructed margins for the book by using the print space construction rules by Jan Tschichold, a well-known German typesetter. The images of the book are now ready to be delivered to a reprint application.

Figure 4: Output of the FEP regarding print space analysis

Figure 4: Output of the FEP regarding print space analysis

Publications

Availability

The tool is licensed by the University of Innsbruck. For further information on licencing, please contact UIBK IMPACT group.

OCR Post-correction and Enrichment