Additional notes

Working with multiple input files

An accurate evaluation of OCR recommends using a representative sample of pages (showing, for example, the variable page layouts found in different sections of a book). In order to facilitate the global evaluation of multi-page collections, the OCRevalUAtion tool accepts also folders as input and then compares the pairs of files found inside those folders. The procedure is identical to that followed for single files (drag and drop or select a folder). For this usage, it must be taken into account that:

  • the number of files in both folders must be identical;
  • the files to be compared should have names which include the same identifier (for example,  page22_gt.txt and page22_ocr.xml; all files in the same folder can have homogeneous prefixes and suffixes).

Command line usage

The tool also allows for command line execution with the following syntax


java -jar ocrevaluation.jar -gt {ground_truth_file} [encoding] -ocr {ocr_file} [encoding] -d {output_directory} [-r {equivalences_file}]

where:

  • {ground_truth_file} is the full path to a ground truth file. Supported formats: Text, PAGE.
  • {ocr_file} is the full path to an OCR result file. S
  • {output_directory} is the folder where the report (HTML format) will be generated.
  • {encoding} is the preceding file encoding type (optional).
  • {equivalences_file} is an optional text file describing equivalences between Unicode characters (two sequences, separated by a comma, of hexadecimal code points per line)