Digital formats and standards

Digital content can be stored in different formats because digital objects are nothing but long sequences of binary digits (zeros and ones) stored in a computer which need to be interpreted to acquire a meaning. The interpretation of this binary source is not predefined and, depending on the objective or the available technology, different languages have been defined to store and interpret the information contained in digital objects. For example, some formats that are very common include:

  • TIFF, JPG , PNG, and  PDF for images.
  • TXT, RTF, XML, PDF, and DOC for rich text.
  • ASCII, ISO8859, Unicode, UTF8, CP-1252 for plain text (encodings).
  • Dublin Core, EDM, MARC, FRBR for descriptive metadata (information about the document itself).

Fortunately, these popular formats are supported by wide variety of tools which can be used to access, create, store or manipulate the content of digital objects. In particular, most OCR engines are able to process TIFF or JPG files and convert them into readable content, usually in XML or PDF format.