Use of digitised and OCRed text collections by end users

Impact CoCDiscussions

Geneviève Cron of the Bibliotheque Nationale de France (BnF) begins by discussing the BNF’s digital library: Gallica.  A million documents digitised since 1992, with OCR as standard since 2005.  OCR accuracy for newspapers is 98% on word level, but results are much more varied – from 60% up. For books, the average accuracy lies at 90%.

[slideshare id=4138323&doc=bratislavaws-cron-bnf-usecases-100518090456-phpapp02]

She describes users of digital services: mostly French or Francophone; special access needs (vision impairment).  Queries about digital store go up every year; most queries relate to content rather than bibliographic information.  Content queries split into thematic, geographic, history, genealogy, newspapers.

Geneviève goes on to describe the Gallica workflow: a volume is OCR’d; some books sent straight to store; but newspapers are manually corrected by service provider, some other books are manually corrected to reach almost 100% accuracy.  As a validation tool for manually corrected text, ABBYY FineReader is used.  OCR is useful when words in user queries are not in bibliographic data – hence subject spread of content queries.  She outlines the Wikimedia/BnF Collaborative Correction plan: going for 100% accuracy through user collaboration.  Text-to-speech and epub projects in progress.  Creation of groundtruthed datasets within IMPACT to aid further research into improving OCR accuracy.

Niall Anderson, BL + Mark-Oliver Fischer, BSB