IMPACT pilot at the KB – Conclusion

LotteNews

In the final months of the IMPACT project in 2012, the KB worked together with one of the current digitisation projects Early Dutch Books Online (EDBO) to research various methods of improving OCR.  EDBO is a combined effort between the KB and the university libraries of Leiden and Amsterdam to digitise 2 million pages of books from 1780 to 1800. After the digitisation process, they decided to hire a number of students to manually correct certain books. This provided the ideal opportunity for IMPACT to tag along and have some of these students work with a number of IMPACT tools.

Having done this pilot, we learned a lot about the tools, what they needed as input, what they provided as output and how they should be used. The results gave us an idea of what we could achieve with each tool, but we all knew that we could only use those as an indication. The differences between the tools and their methods were too great to base a decision on.

Goals

However, what we can say is this: When you want to improve your OCR, it is very important that you have a clear goal in mind. You should ask yourselves at least the following questions: How much better should the OCR be? How much money would we like to spend? How much effort can we spare and from whom? Is improving the OCR the only goal or do we also have others in mind, such as crowdsourcing? The answers to these questions can all result in very different ideas about the OCR improvement project and consequently the best tool for you.

When to use which tool?

To make it easier, we”™ve divided our tested tools into three categories: Basic tools, Advanced tools and re-OCRing.

Basic tools

Advanced tools

Re-OCRing

Alto Edit

LMU Profiler and Post correction tool

ABBYY FRE 10 with a historical Dutch dictionary

PlaIR platform

CONCERT

(not possible to test in this pilot, because of its setup)

Adaptive OCR

Basic tools

The Basic tools are the easiest to use. They require (almost) no training and are web-based. These would be perfect for involving the crowd or other volunteers. You would need to have back-up from within the library and your infrastructure should be able to handle all the improved data all the time. It would be possible to get a very high OCR accuracy though, with many people working on the material.

Advanced tools

The Advanced tools require more training and it is even imaginable that they are used by library staff only. However, they do provide more functionalities and a higher correction speed than the Basic tools, because of the batch corrections (LMU) and carpet sessions (CONCERT). Both tools can get a very high accuracy when used to their fullest, but that would require some time.

Re-OCRing

Re-OCRing would be a very good option when you want to spend very little (manual) effort and have some money to spend on licenses. It would also be static, which would be an advantage to some library infrastructures and would also improve the OCR quite a bit. Especially when also plugging in a historical dictionary, which have been produced in nine languages in IMPACT.

Finally

This pilot was done with KB people and KB material, with the KB infrastructure in mind, so when you (or your library) thinks about OCR correction, please do a pilot of your own. We”™ve learned a great deal about what we think is important for us, what is possible and what material and tools would be our best fit, but that might be very different for each library. We would of course be happy to help and share our experiences via the Centre of Competence.

Read more about our set-up of the pilot in this blog post: IMPACT pilot at the KB – Introduction and set-up or read the full pilot report below.

Related Files