Katrien Depuydt provided a brief overview of the IMPACT project’s work packages devoted to creating language tools and lexicon to aid in both information retrieval and OCR processing. How might one measure successful improvement to the access of text? She cleverly posits that the key will be in asking ourselves: “Can we handle the “world”?
In an 18th century dutch periodical ‘werried’ was the spelling of the day and using OCR built with a simple dutch dictionary you would need to begin your search with that term and the results would be necessarily limited. What we really want she notes, is to key in the modern term “world” and retrieve all the appropriate variants in the text.
This is where IMPACT’s work in building lexica comes in, and we start to discover that yes, we CAN handle ‘the world’. In the course of the project an OCR lexicon, an IR lexicon and an NE lexicon were created for 9 languages and these plug into ABBYY FineReader enhancing the OCR and the retrieval. No simple task, the work required analysing different language resources available for each unique language, identifying tools already available, special character sets and the like. She gives the example of Bulgarian which had no existing dictionaries or lexica and some characters were not recognised by Abbey Fine Reader creating a unique set of challenges. How these challenges were overcome will be explored in more detail at the forthcoming in the Parallel Session 2: Language Session later this afternoon.
View the presentation here:
and the video here: