Unicode

Unicode is a standard to represent text in any writing system. Essentially it maps every character to a unique integer number (its code). This numbers can be easily stored in digital devices such as computers.  Since the standard maps above 100,000 characters, the codes are stored in compact forms such as UTF8 or UTF16. Therefore, UTF8 encoding is just a compact way to store character codes in a digital file.

The Unicode standard provides good coverage of modern scripts but often lacks support for old glyphs (a glyph is a possible representation of a character or a sequence of characters). For example, Unicode assigns the number 10016 (or 2729 in hexadecimal) to the ligature Maltese cross (✠) but has no code for the triple ligature of long s + long s + i (ſſi).

Some consortia, such as the Medieval Unicode Font Initiative, use the so-called the Private Use Areas  (PUA)  to represent characters or glyphs. The PUA are ranges of codes which will remain officially unassigned and whose meaning can be defined by users for a particular context.

Since the production of ground-truth may require in some cases the encoding of specific glyphs, it is necessary to define when a certain output of the OCR engine can be accepted as equivalent to the ground-truth character. For example, a q with tilde (q̃) can be considered equivalent to a q in some contexts but equivalent to the word “que” in historical Spanish.

The OCR evaluation tool described here allows the user to define equivalences between Unicode character sequences. If the difference between the ground-truth text and the OCR output can be understood as the substitution of a character or sequence by another equivalent character or sequence, this difference will not be considered a mistake and will not contribute to the error rates.

The equivalences will be stored in a text file containing two equivalent sequences per line, separated by comma (comments can optionally follow after another comma). The characters are described by the Unicode code in hexadecimal notation (in case of doubt, this is an excellent site to search for Unicode values).

Sample lines of the equivalences file:

FB00, 0066 0066, Latin small ligature ff (ff)

F50D, 0071 0301 A76B, Latin small letter q + combining acute + Latin small letter et (q́ꝫ)

FEFF, 0020, Byte order mark