In early 2013, the Library Council of the Humanities and Social Sciences at KU Leuven (Belgium) produced a white paper concerning the role of libraries in digital scholarship (and, particularly, the digital humanities). This white paper described the current services provided by libraries at KU Leuven in the field of digital scholarship, but also discussed the opportunities and challenges ahead and identified a number of key areas on which the libraries concerned wanted to focus in the near future.
The list included:
- Initiating and supporting digitisation projects.
- Supporting relevant grant applications (for instance by taking over the writing of the more technical or financial sections in a grant application and helping the researchers to translate a research idea into a feasible project)
- Acting as a valued partner in digital humanities projects, from inception to completion (and beyond) – so getting involved in the writing and planning of them, being involved in the actual project (e.g. by supplying digitisation services), and being involved in the preservation and continued dissemination of the research results once the funding has run out.
- Providing training in tools for digital humanities research.
- Playing an expert role in the field of scholarly communication.
So far, most of our attention has been focused on digitisation projects, partly because the University Library already has a considerable track record in this field thanks to its Digital Lab, a high-tech digital photography centre which is well known for its work with the Portable Light Dome. The decision was taken to strengthen and expand these digitisation efforts by including the digitisation of textual material with a view to building digital corpora of texts (rather than digitizing books as representations of the physical object). In order to do this, we needed to gain more competence in the fields of OCR (Optical Character Recognition) and NER (Named Entity Recognition) – especially when applied to non-mainstream materials (such as early printed books or manuscripts) – so that we could integrate these technologies, where applicable, in the digitisation workflow at KU Leuven.
The Support action Centre for Competence in Digitisation (Succeed), which promotes the take up and validation of research results in mass digitisation with a focus on textual content, provided us with an excellent opportunity to achieve this goal. It was the ideal context for us to get involved as a library, since Succeed not only supports the validation of digitisation and linguistic tools and resources created by research and development programs, but also supports their transference for exploitation in libraries and other cultural heritage organisations. Succeed thus offered us the chance to test relevant digitisation tools and to make our modest contribution to their development and validation, whilst, at the same time, allowing us to gain some much-needed competence in these fields.
With the invaluable support from our colleagues from the Instituut voor Nederlandse Lexicologie (INL), we tested the following tools:
- The aptly named Aletheia tool to build the ground truth.
- An ocrevalUAtion tool which allows for the comparison of a reference text with the OCR output, but also for the comparison of the output of two different OCR engines or two different settings of one OCR engine.
- The ABBYY Finereader engine SDK 11 to execute the OCR, with User Pattern Training and with the IMPACT historical lexicon for Dutch, integrated as a FineReader external dictionary;
- A number of NER tools: NE Attestation Tool, NERAnnotator (Europeana Newspaper NER tool), Stanford NER tool and NERT.
To test these tools, we selected a small corpus of books which, in our view, would offer an excellent test case to evaluate OCR and NER tools for the digitisation of non-mainstream materials. The selected corpus consisted of 13 books printed in the 17th, 18th or 19th century – all of them Dutch translations from Latin which are preserved in the Gulden Librije (i.e. a special collection of the Arts Faculty Library) – which had never been digitized before, namely:
- From the 17th century: Tacitus, Ghedenkwaerdige geschiedenissen der Romeinen (1645); Virgil, Aeneis (1662); Ovid, Treur-gesangen (1692)
- From the 18th century: Boethius, Vertroostinge der wysgeerte (1703); Seneca, Christelycke Seneca (1705); Nepos, Leeven der doorluchtige veld-ooversten (1726) & Leevens van doorlugtige mannen (1796); Horace, Hekeldichten en brieven (1728); Virgil, Wercken (1737); Augustine, Belydenis (1741)
- From the 19th century: Ovid, Treur-digten (1814-5); Horace, Over de dichtkunst (1866); Augustine, Stad Gods (1876-8)
The major outcome of the project is that it allowed us to develop workflows for OCR and NER, which will prove essential in the further integration of these technologies in our general digitisation workflow. As hoped, our involvement in the Succeed project has also permitted us to strengthen and expand our digitisation services and has helped us to prepare ourselves for a new and larger project concerning OCR and NER for early modern Latin texts which is planned to start in 2015. Finally, the Succeed project was an excellent opportunity to join an international network of competence centres in digitisation, and to share and exchange experiences, best practices, and test results. Needless to say, we hope to remain an active partner in this network and to continue to make a contribution to the field of mass digitisation.
Sam Alloing, University Library of KU Leuven & Demmy Verbeke, University Library of KU Leuven