Polish IMPACT ground truth added to Poliqarp search engine

Impact CoCNews

Polish ground truth texts in the form of a corpus, developed within the IMPACT project, have now been added to the University of Warsaw’s Poliqarp seach engine at http://poliqarp.wbl.klf.uw.edu.pl

The Poliqarp search engine, made available by the Formal Linguistics Department of the University of Warsaw, facilitates searching digitalized texts in the DjVu format. The engine is a modification of the Poliqarp system (developed in the Institute of Computer Science of Polish Academy of Sciences) used to support the National Corpus of Polish, so it has the same query syntax.

The search engine provides access to two versions of the IMPACT Polish GT corpus: so called one-dimensional and two-dimensional. In the one-dimensional version the hyphenated words have been automatically reconstructed, while in the two-dimensional version the hyphenation has been preserved in its original form.

The corpus uses data made available by PSNC Digital Libraries Team, namely a set of full text versions of selected Polish historical documents from four digital libraries in Poland. The texts have been prepared in the framework of the IMPACT project and used as so called ground truth for evaluation and training of OCR programs.

The corpus has been created by Krzysztof Szafran using various free tools.

Some additional information about the corpus is available in Janusz S. Bien‘s notes Changes to the IMPACT project Polish Ground-Truth texts and Delivering the IMPACT project Polish Ground-Truth texts with Poliqarp for DjVu.

The corpus consists of ca. 1.6 million segments. This version of the corpus is available since April 20, 2012.