In the final months of the IMPACT project in 2012, the KB worked together with one of the current digitisation projects Early Dutch Books Online (EDBO) to research various methods of improving OCR. EDBO is a combined effort between the KB and the university libraries of Leiden and Amsterdam to digitise 2 million pages of books from 1780 to 1800. After the digitisation process, they decided to hire a number of students to manually correct certain books. This provided the ideal opportunity for IMPACT to tag along and have some of these students work with a number of IMPACT tools.
The decision to ask some of the students to work on the pilot was easily made, but which tools would we want to include? On the one hand there are the tools that require manual input and on the other hand we also have the automated processes, i.e. running OCR again. Both are valid options when improving OCR, so we chose to do both methods in the pilot. In the end we decided to work with:
- CIS Postcorrection system with error profiling
- Re-OCRing with:
- Alto Edit, a tool developed at the KB
- PlaIR platform from the University of Rouen (an improved version of the Trove newspaper tool)
We decided to include two tools from outside the IMPACT project, namely Alto Edit and the PlaIR platform. Alto Edit was the tool that all students would be working with when correcting the OCR and is developed at the KB. It is a very simple side-by-side correction tool with the option to also change some of the segmentation. The PlaIR platform is built by the University of Rouen to and meant for newspapers. It builds upon the Trove tool, developed by the National Library of Australia and the people of Rouen have been very helpful in making it available to us and even made some changes to make working with books easier.
Setting up a pilot to ensure all outcomes are comparable is hard work! The tools are not the same, the people are not the same and even though we used stopwatches, the time isn”™t the same. To make sure the results we got were as similar as possible we set up some guidelines and rules.
- Each test was divided into two steps: a training and try-out session of 2,5 hours and the following day the actual test of 3 hours, with a 15-minute break.
- The sets that were used to train the testers were not the same material as the actual test, but did come from the EDBO collection.
- We divided the books into 6 sets, which were a cross-section of the books (i.e. set 1-1 were pages 1-4-7-etc. from Book 1) and each tester had a different set per tool.
- The testers used the same guidelines to correct the material (mainly to type what the text said and to separate any ligatures).
Unfortunately, even with all our efforts, the pilot was not waterproof. LMU could for instance only use ABBYY XML files, while the KB had ALTO files. The difference of quality between the newly generated ABBYY XML files for LMU and the KB”™s ALTO files was a couple of percent, so the end result would differ significantly if we could have used our own files. The testers might have gotten bored with the material and the pilot after a couple of sessions (unimaginable”¦) or someone found a job and had to leave (which happened in the PlaIR test). Most of the tools could also fix segmentation errors, which meant higher accuracy, but were thus more time-consuming. These were all issues that we could not take into account in the pilot, so we decided to go ahead and run the test with this in mind.
Even with everything in mind, people dropping out and some technical issues which lead to a couple of days copy/pasting in text-files, we got some very interesting results!
This comparison combines tools that include manual correction of the text and tools that processes the material automatically without any human interference. Please see the Conclusion for a more elaborate explanation of the differences between the various methods. Also, the results from PlaIR in Book 2 are only based on two of the three sets, as one of the students could not join the final test. However, this does give a nice view of the results the KB could possibly obtain with various methods, when working with the EDBO collection.
What we thought was very interesting is the small difference between LMU and PlaIR in Book 1, with LMU having batch-corrections and PlaIR being a side-by-side method. This difference could be attributed to the other files we had to use for LMU of course. Another interesting observation is the good results we get when using ABBYY in combination with the Dutch historical dictionary! When we showed this to the producers of the dictionary they said we could get even better results with some adaptations which they included in an updated version of the plug-in for ABBYY. Unfortunately, that update was already too late for our pilot, but if you”™d like to know more, they”™re happy to explain how and show off their results.