What’s in a word?

Rafael CarrascoDiscussions, OCR evaluation/quality control

Word error rate is often used as a measure of OCR accuracy. Although words are the relevant unit in information retrieval, the definition of word in the context of OCR is not as simple as it might appear at first glance. This post compiles some of my ideas on the relation between words and characters.

In linguistics, a word is often defined as the smallest element that may be uttered in isolation. In natural language processing a more practical definition is often employed: a word is a continuous sequence of letter or number characters. This definition, which neglects punctuation, may be however too restrictive. For example, “vice-president” cannot be naturally split into “vice” and “president” without a change in meaning. Or “3.1416″ is a different concept than “3″ followed by “1416″ (a possible year number, for example).

We will therefore classify characters into three categories:

  • Letters and numbers, which are the standard components of words (excluding punctuation and other symbols).
  • Letter connectors. They are characters which can be found between letters inside a word (but not at word extremities).
  • Number connectors, or non-numbers which can be a significant component in numbers.
  • Stoppers, or characters that cannot be part of a word.

The list of connectors is partially language dependent, but could include:

  • Hyphens, as in vice-president.
  • Ampersands, as in R&D.
  • Apostrophes, as in “cont’d”.
  • Plus signs, as in “I+D” (Spanish for “R&D”).
  • At signs, as in “AT@T”
  • Dots, as in “I.B.M”
  • Middle points in Catalan, as in “col·lecció”.
  • Slashes in vulgar Spanish writing, as in “niños/as”

Number connectors include:

  • Dots, as in “3.1416″.
  • Commas, as in “1,000″.

Particular characters, like the percent sign “%” and ordinals “ º, ª ”, are very significant components of a number, and can either be considered an intrinsic part of its meaning or be seen as a modifier (and, thus, a separate word).