Advanced usage

Working with multiple input files

An accurate evaluation of OCR recommends using a representative sample of pages (showing, for example, the variable page layouts found in different sections of a book). In order to facilitate the global evaluation of multi-page collections, the ocrevaluation tool accepts also folders as input and then compares the pairs of files found inside those folders. The procedure is identical to that followed for single files (drag and drop or select a folder). For this usage, it must be taken into account that:

  • the number of files in both folders must be identical;
  • the files to be compared should have names which include the same identifier (for example,  page22_gt.txt and page22_ocr.xml; all files in the same folder can have homogeneous prefixes and suffixes).

Advanced options window

If the checkbox before “Advanced options is marked (with a mouse click), then the input window is expanded as follows:

ocrevalUAtion10

The additional input area allows for the upload of a file containing equivalences between Unicode characters, as described in section 2.4. It is also possible to enable the default equivalences defined by the Unicode standard as “compatibility equivalence”. For example the glyph representing the ligature “ff” and the two character sequence “ff” will be equivalent if the check-box “Unicode compatibility of characters” is marked. For additional details, please consult the Unicode standard website (http://unicode.org/reports/tr15/#Canon_Compat_Equivalence).

Command line usage

The tool also allows for command line execution with the following syntax

java -jar ocrevaluation.jar -gt {ground_truth_file} [encoding] -ocr {ocr_file} [encoding] -d {output_directory} [-r {equivalences_file}] [-c]

where:

  • {ground_truth_file} is the full path to a ground truth file.
  • {ocr_file} is the full path to an OCR result file.
  • {output_directory} is the folder where the report (HTML format) will be generated.
  • {encoding} is the name of the encoding used for the preceding file (optional).
  • {equivalences_file} is an optional text file describing equivalences between Unicode characters (two sequences, separated by a comma, of hexadecimal code points per line) .
  • the option -c activates the compatibilty of Unicode characters.