In late 2010, IMPACT and the Biodiversity Heritage Library for Europe (BHL-Europe) got in touch and decided to cooperate:
- in the area of developing sustainable business models for cultural heritage and
- in evaluating the success of several IMPACT tools on data from the BHL digital collection
BHL-Europe is the European branch of the Biodiversity Heritage Library, a major effort to bring together existing digital collections of biodiversity literature from libraries all over the world, particularly those associated with natural history museums and botanical gardens.
After an initial meeting at the KB National Library of the Netherlands, participants from both projects quickly agreed to jointly host a session on business models for cultural heritage together with CATCHPlus alongside the DISH2011 conference. The session called After the Brain-Storm: Innovative Ways to Sustain Project Results discussed issues related to business development for cultural heritage with a particular view on sustainability. More information can be found here. Another workshop on OCR was provided through IMPACT participation at the BHL-Europe Content Providers Meeting at the Royal Belgian Institute of Natural Sciences, 1-2 December 2011 in Brussels, Belgium.
Evaluation with BHL dataset
In the course of 2011 plans for an evaluation of several IMPACT tools using content from the BHL digital collections became more concrete. A decision was taken to treat BHL-Europe more or less in the same way as the five demonstrator libraries who were partners in the IMPACT project ““ which meant including all steps from the selection of a representative dataset to the creation of ground-truth used for the evaluation and the development of several workflows comprising chains of IMPACT tools which would be evaluated in the end.
Accordingly, in a great effort involving a mixed team from both projects, a set of nearly 5.000 images from the BHL repository was selected and ground truth (up to 100% correct transcriptions of text and layout on a page) was produced for about 2.500 from these using an external service provider. The production of the ground truth corpus has proven to be a particularly complex task ““ the high accuracy level requires that the most severe quality assurance mechanisms are applied both to layout transcription and textual content (in Unicode). Aletheia, a tool developed in IMPACT to create and validate the ground truth transcriptions was provided to BHL-Europe and then used to verify the agreed accuracy in the delivered ground truth files.
Once all the ground truth data had been produced, three workflows were designed for carrying out the evaluation:
- Standard IMPACT processing chain comprising methods for border removal, dewarping, binarisation and OCR”™ing from IMPACT partners
- Comparison of new IMPACT FineReader and ground truth
- Comparison of Tesseract with IMPACT FineReader and ground truth
Processing of the BHL data turned out to be quite challenging ““ IMPACT tools have been tailored to historical text material while the BHL-Europe content also comprised plates with illustrations of plants or species and only small captions or regions containing text. The frequent appearance of non-textual objects on these pages turned out to be a serious problem for the image enhancement methods which often returned corrupted images and sometime could not provide an output at all, rendering the results useless for the subsequent text recognition step.
Two scanned example page images from the Biodiversity Heritage Library
After this, another workflow was tried, this time leaving out the image enhancement and focussing on the OCR only. The results from this run already looked much better, though some alignment issues between the ground truth and the OCR prevented a full scale automatic evaluation. The third test was thus only carried out on a subset of the data, this time comparing IMPACT FineReader with Tesseract (version 3.01) performance. The results indicated that Tesseract is almost on par with FineReader Engine 10, except in those cases were specific dictionaries are used within FineReader.
For further details on the evaluation, please see the according report from BHL-Europe.
Finally, the good news for everyone interested in the experiment is that IMPACT and BHL-Europe have agreed to make this data freely available under CC-BY 3.0 license. This means that you can get the master images as well as the >99.95% correct transcriptions in PAGE format for download from here. We are already looking forward to the research that will be enabled by the availability of such a dataset!
For more information about IMPACT, please contact email@example.com.
For more information about BHL-Europe, please contact firstname.lastname@example.org
Data from IMPACT and BHL-Europe freely available under CC-BY 3.0 license.
Share this Post