Impact will be next 28th of October in CERL ANNUAL SEMINAR 2014 ‘The Application of Text Encoding Facilities to Digital Versions of European Early Books’, that will take place in the National Library of Oslo (Norway). In the conference, Tomasz Parkoła (Digital Libraries Team, Supercomputing and Networking Center, Poznan) will talk about ‘The Impact Centre of Competence: tools for text digitisation and transcription’.
The DFG in Germany has announced a research programm to enhance digitization and OCR related technology. It covers similar topics as addressed in Impact.
The second Digitisation Day started with two parallel sessions. The scientific presentations related to “Best practices and experiences in digitisation of cultural heritage”, and the round table “Future research for the digital library”. Continue reading
2nd Succeed hackathon at the University of Alicante
Is there anyone out there still thinking that a hackathon is a malicious break-in?
Far from it. It is the best way for developers and researchers to get together and work on new tools and innovations. The2nd developers workshop / hackathon organised on 10-11 April by the Succeed Project was a case in point: bringing together people to work on new ideas and new inspiration for better OCR. The event was held in the Claude Shannon room of the Department of Software and Computing Systems (DLSI) of the University of Alicante, Spain. Claude Shannon was a famous mathematician and engineer and is also known as the “father of information theory”. So it seemed a good place to have a hackathon!
On the 10th and 11th of April 2014 at the University of Alicante, the Succeed project held a hackathon whose aim was to look at improving the state-of-the-art open-source tools for the digitisation of textual content such as books and newspapers.
Which will be the cost estimation for a project pursuing 100% post correction of OCR in the full text offered to the users (based on an average book of 200-300 pages)? An example is the Gutenberg project (http://www.gutenberg.org/).
Has your institution experience on this service?
This practical session started with the attendees introducing themselves and splitting up into 3 groups, so that each could work on a different set of tasks based on a Case Study.
Jesse De Does from the INL gave a brief but rich presentation on the evaluation of lexicon supported OCR and the project’s recent improvements. To evaluate lexica in OCR, the FineReader SDK 10 is used. In short, the software measures OCR with a default included dictionary, and, for each word or fuzzy set, it gives a number of alternatives and segmentations. It is then up to the user to manually select the most suitable or probable option. Lexicon, however, may include errors and the fuzzy sets created by FineReader may be too small (we will never have all spelling variations or compounds). Thus, a number of actions, including word recall, dictionary cleaning and implementation of historical dictionaries, are taken in order to increase performance, even if by small percentages.
Asaf Tzadok (IBM Haifa Research Lab) showed us IBM’s CONCERT tool which facilitates collaborative OCR correction. CONCERT (Cooperative Engine for the Correction of Extracted Text) works in three steps: character session, word session and page-level session.
Richard Boulderstone, Director of eStrategy and Programs at the British Library, kicked off the IMPACT Conference this morning with a suitably impactful statement of scope: the British Library, he estimates, has nearly 5 billion physical pages in a 150 million object collection.