OCR/NER workshop at the DH Summer School (Leuven, 8 September)

logo_kuleuvenIn the course of 2014, colleagues from various divisions of the University Library of KU Leuven (Belgium) were involved in testing tools for OCR (Optical Character Recognition) and NER (Named Entity Recognition), developed in the framework of the Support action Centre for Competence in Digitisation (Succeed). With the invaluable support from colleagues from the Instituut voor Nederlandse Lexicologie (INL), the following tools were tested on a selected corpus of books printed in the 17th, 18th or 19th century – all of them Dutch translations from succeed_green_blkLatin, preserved in the Gulden Librije (i.e. a special collection preserved in Leuven): the Aletheia tool to build the ground truth, ocrevalUAtion, the ABBYY Finereader engine SDK 11 (with User Pattern Training and with the IMPACT historical lexicon for Dutch, integrated as a FineReader external dictionary), the NE Attestation Tool, the NERAnnotator (Europeana Newspaper NER tool), the Stanford NER tool and NERT. A previous blog post reported on this testing activity (http://www.digitisation.eu/blog/tools-evaluation-university-library-ku-leuven), which was awarded by the members of the Impact Centre of Competence Executive Board with the first prize in the second edition of the Succeed Awards (http://www.digitisation.eu/blog/2nd-edition-succeed-awards). Thanks to all of this support from INL and Succeed, the University Library of KU Leuven could continue its efforts of developing workflows for OCR and NER, and thus expand its general digitization services. As hoped, the involvement in the Succeed project also helps the University Library to prepare for a larger project concerning OCR and NER of non-mainstream material (such as early modern Latin texts).

To create even more awareness of the possibilities of OCR/NER (and of the services which can be provided by the library), and also as a way to provide (free!) training in the use of these technologies, a full-day workshop will be devoted to OCR and NER in the context of the annual DH summer school, which takes place in Leuven on 7 and 8 subsiteLogoSeptember (http://www.arts.kuleuven.be/digitalhumanities/summer_school/index). The summer school consists of a number of activities for all participants (e.g. presentation of DH research in Leuven and poster sessions), a series of specialized workshops for maximum 20 participants, and a panel discussion on ‘Digital Humanities and/in libraries’. The workshops address the Text Encoding Initiative (TEI, by Barbara Bordalejo), Stylometry with R (by Mike Kestemont and Jan Rybicki), Databases and Network Analysis (by Mark Depauw & Yanne Broux) and OCR/NER (by Sam Alloing, Roxanne Wyns, Katrien Depuydt, Jesse de Does). Each workshop takes the form of a more or less theoretical introduction (explaining the technology and its uses, explaining the different available tools, presenting one or more digital humanities projects which have used this technology), followed by hands-on training in one or more tools. The entry level is low: no previous knowledge of the technology/tools explained is necessary.

In the OCR/NER workshop, participants will learn about the basic principles of OCR and NER for textual content and the different steps in the process such as building a ground truth, a comparison test for evaluating of the quality and success rate of different OCR outputs, user pattern training, etc. A number of different tools will be introduced such as the Aletheia tool, the ABBYY FineReader OCR engine, and a number of Named Entity tools, such as the NE attestation tool, NERT and the Stanford NER tool. During the hands-on-session workshop participants (who are encouraged to work on the textual material of their own choice) will be assisted in creating a ground truth, evaluating the outcomes of different OCR engines, etc.

Finally, both the summer school as a whole, as the workshops as such (for which you can register separately) are quite unique in that there are completely free to attend by anyone who is interested in learning about the topic. On top is this, the summer school provides a welcome opportunity to visit the medieval university town of Leuven and its libraries (and invites to extend the stay by attending the ‘What do we lose when we lose a library?’ conference, which takes place from 9 to 11 September: http://kuleuvencongres.be/libconf2015/website). So timely enrollment (via http://www.arts.kuleuven.be/digitalhumanities/summer_school/registration-digital-humanities-summer-school) is key.

Demmy Verbeke

University Library of KU Leuven

