Interoperability Framework

Abstract

The technical and research partners in IMPACT have developed more than 20 different tools for various stages in the OCR process. Generally speaking, all of these tools operate on image or text data, either by modifying the data or by extracting information from it. IMPACT has therefore also developed an overall technical framework that allows for a loose coupling of these tools and the exchange of data between them.

With this framework it is possible to form a pipeline of the tools, where the output of one tool is used as input for the next tool. Another incentive for creating such a framework is that the historical material that libraries, archives and other content holders are digitising in large quantities is very different in nature. While newspapers would need very good segmentation to avoid columns of text being mashed together, which destroys the correct reading order, books that have been scanned at a lower quality may benefit more from the removal of noise from black borders. Because there is no optimal combination of tools (called a workflow) for every purpose, users have to be enabled to try and evaluate certain combinations to find their optimal workflow.

Therefore, the IMPACT Framework was developed with four key aspects in mind: modularity (users should be able to create their optimal workflow freely from any combination of tools), transparency (it should be possible to evaluate each individual step in a workflow separately), flexibility (tools and services should be easy to deploy and maintain in heterogeneous environments) and extensibility (it should also be possible to include tools outside of IMPACT).

To fulfil these requirements, each IMPACT tool was wrapped as a web service, using a generic Java-wrapper. All the web services together with common open source components from the Apache Software Foundation form the basic layer of the Interoperability Framework.

To easily create complex workflows from single tools, the Taverna workflow system has been added as another layer to the Interoperability Framework. Taverna is an open source and domain-independent suite of tools to design and execute scientific workflows. Each IMPACT tool is represented as a workflow module for Taverna, with documented input and output ports. The output of one tool can then be connected to the input of another by a simple drag-and-drop operation, which allows non-expert users to quickly create their own workflows and obtain results. Together with the ground truth files (100% correct transcription of the text and layout on a scanned page), this makes it possible for users to evaluate the results of different workflows and find out which gives them the best result for their specific material.

Evaluating the Dewarping tool (which straightens curved text lines on a page): two different Taverna workflows

Evaluating the Dewarping tool (which straightens curved text lines on a page): two different Taverna workflows

Finally, the myExperiment environment, which is integrated with Taverna, is used as the main platform for connecting the resources, such as tools and workflows, with the users in cultural heritage institutions throughout Europe. By means of this Web2.0 platform people can not only share their workflows, but also their experiences with applying the tools in their own context and with their material. The IMPACT Interoperability framework architecture consists of open source components only.

Publications


Availability

All core components of the Interoperability Framework are freely available under the Apache License version 2.0. You can obtain the source code from the IMPACT Centre of Competence github page.