The dataset produced by IMPACT is a landmark and an invaluable resource for the field of OCR and language technology related to historical documents. With over half a million images from the various European libraries in IMPACT and an unprecedented number of more than 50.000 ground truth files containing a high level of detail with full Unicode encoded text (including ligatures and special characters) and complete layout information (segmentation, region metadata and reading order), it will foster further research and development for years to come.
One of the IMPACT partners that worked on the dataset was the Poznań Supercomputing Centre (PSNC) from Poland. The activities performed by their Digital Libraries Team resulted in a set of full text versions of selected Polish historical documents from 4 digital libraries in Poland:
- ElblÄ…ska Library (ElblÄ…ska Digital Library)
- The KÃnik Library of the Polish Academy of Sciences (Digital Library of Wielkopolska)
- Poznań University Library (Digital Library of Wielkopolska)
- The Institute of Journalism, University of Warsaw (Digital Library of Polish and Poland-Related News Pamphlets)
- WrocÅ‚aw University of Environmental and Life Sciences (DolnoÅ›lÄ…ska Biblioteka Cyfrowa)
Through the link below you can find a list of the documents with corresponding source data (master files) and ground truth data (full text versions). The full text versions are encoded in the PAGE XML format (Page Analysis and Ground-truth Elements) developed by the University of Salford. The description of this format can be found here. Full text documents have an accuracy of around 99.95%. There are two versions of full text:
- Full text versions on region level (paragraph), where accuracy is around 99.95%.
- Full text versions on region level (paragraph), with additional information about coordinates for lines, words and characters. This additional information comes from the optical character recognition process and was not corrected/changed in any way. Nevertheless, this kind of information can be helpful in case one wants to have approximate coordinates of the word or character on the image. The accuracy level for the region level is still around 99.95%.
Altogether 4693 files were processed: the corresponding full text versions have 6.890.677 characters. The size of the master files is around 16,5 Gb. The size of the full text is around 300MB, and the size of full text with additional information is 700MB. Master files are available thanks to particular libraries.
All materials are available under Creative Commons Attribution 3.0 Unported License.
Share this Post