Recommendations on formats and standards useful in digitisation. Recommendations
The aim of Succeed recommendations is to help stakeholders (research groups, companies and cultural heritage organizations) to select a particular format or standard for their digitization-related activities. The recommendations are divided into 3 parts, each focused on a specific aspect of digitization activities:
- Long-term preservation -This part covers formats and standards related to master files, metadata and OCR results.
- Online delivery – this part covers formats and standards related to delivery files, descriptive metadata, OCR results and identifiers.
- Advanced and supporting technologies – this part covers guidelines for semantic technologies, linguistic resources and tools packaging.
The above division is dictated by practical reasons – if particular institution performs digitization for preservation activities, then attention should be put on the long term preservation part. If the institution wants to perform digitization for access, then online delivery part should be of interest. If the institution does both (digitization for preservation and access) then long term preservation and online delivery parts should be investigated. Finally, if there is a vision of using new and advanced technologies to enhance the digitization workflow, then part related to advanced and supporting technologies is relevant for consideration.
Each particular aspect discussed in this section can have several recommended and alternative items (e.g. formats or standards). For instance, “Master file format – textual documents” part has TEI and PDF/A as recommended formats and UTF-8 encoded plain text as an alternative. It means that TEI and PDF/A are equally applicable and can be selected based on specific preferences or experience of particular institution. It also means that UTF-8 has some limitations which caused it to be an alternative, but not the first selection. Nevertheless if the institution does not have appropriate resources to create PDF/A or TEI documents (e.g. no appropriate software or lack of staff) or have other reason for not using the recommended items (e.g. policy), then the alternative format is proposed and can be considered as good. In discussed example UTF-8 is an alternative format and it will in most cases require a lot less effort to create. But even if an institution decides to use an alternative format, it should look for opportunities to move to the recommended one, as it is the most appropriate way of dealing with particular digitization aspect.
This part of recommendations covers formats for master files, descriptive metadata, structural metadata, administrative metadata and OCR results. The reason for selecting particular format as a recommended one is strongly connected with its sustainability factors, especially disclosure and adoption.
Master file format – still images
Alternative: JPEG2000 (JP2)
For preservation of still images the recommended format is TIFF. It is the most popular format both in the context of existing recommendations (94% of them indicate TIFF) and Succeed survey results (87% of respondents indicated TIFF). The format is well documented and has strong support in software related to scanning, OCR, manipulation and conversion. The recommended characteristics of the TIFF format are presented on the Table 1.
Table 1 Summary of recommended characteristics for the TIFF format
At least 300dpi. The final resolution should depend on the document type. The goal is to have all important characteristics of the document clearly visible. Quality Index can be helpful when calculating final resolution.
24-bit for colour images, 8-bit for greyscale
Uncompressed or LZW compression
ICC-based ( ICC stands for International Color Consortium)
Number of pages
1 per file (monopage TIFF)
The alternative master file format is JPEG2000 Part 1 (Core) – JP2. The format is quite popular in existing recommendations (53%), but not so much in use in current digitization activities (14% of respondents of the Succeed survey use it for master files). It seems that in terms of format usage JPEG2000 looks like an emerging format rather than a well-established one. The format is well documented, but is also quite complicated. It has the capability to act both as a master file and delivery file, therefore it is especially interesting to consider for production master files. Unfortunately JPEG2000 does not have wide support in terms of software, although there are ongoing activities that develop tools supporting JPEG2000 in various ways (e.g. OpenJPEG, Jpylyzer, IIIF). Because of these current limitations it has been identified as an alternative format.
Master file format – textual documents
Recommended: TEI, PDF/A
Alternative: UTF-8 encoded plain text
For preservation of documents available in textual form we recommend using of TEI or PDF/A.
TEI is focused on texts representation, including various characteristics like structural or conceptual. The format is very flexible which can be both advantage and disadvantage. Fortunately there are multiple customizations of TEI, including TEI Lite for the elements sufficient for simple documents. TEI Lite is the most widely used TEI customization. TEI is popular in digital humanities, which also indicates it as a good option for preservation of texts. More information on TEI can be found in section Error: Reference source not found.
PDF/A is an ISO standard dedicated for archiving various types of documents in digital form. The format is relatively new therefore it is not widely indicated as a master file format, neither by existing recommendations nor by current practices gathered by Succeed survey. Nevertheless it is based on PDF, which is very popular and also used for master file by 23% of survey respondents. Therefore it is reasonable at least for those who already use it to move from regular PDF to PDF/A. It is very important to distinguish PDF/A from the PDF format. PDF/A is an archival format, which is based on PDF, but introduces specific restrictions/requirements to ensure appropriate visual representation of the document and other characteristics. For example it requires fonts to be embedded in the document, ICC-color based profiles and disallows encryption. There are three consecutive versions of the PDF/A format, each having several conformance levels. The conformance levels include:
- Level B – ensures appropriate visual appearance of the document. This level has been introduced in the PDF/A-1 version.
- Level A – builds on level B, but in addition requires structured information about the document. This level has been introduced in the PDF/A-1 version.
- Level U – ensures that the text in the document can be extracted and appropriately interpreted. This level has been introduced in the PDF/A-2 version.
Also consecutive versions of the format added new capabilities to the format. The most important aspects of each version are:
- PDF/A-1 – introduces restrictions related to fonts, colors, etc.
- PDF/A-2 – introduces possibility to have different layers in the document, allows JPEG2000 compression and attachments to the document.
- PDF/A-3 – makes the attachments mechanism more flexible.
None of the versions is obsolete therefore all of them can be used for archiving purposes. They simply provide different set of features, which can be used, and different sets of conformance levels.
An alternative format for text representation is Unicode plain text file (encoded with UTF-8). The reason for it to be alternative is lack of support for structural information, as the file simply represents stream of characters. We recommend using UTF-8 encoding as it is compatible with ASCII and is able to encode various diacritics. It is also worthwhile to use normalized forms of UTF-8 to store text files.
In case of historical documents, especially those with special characters not currently available in the Unicode standard, we recommend using MUFI specification (code points). Such an approach will minimize the risk of code point collisions between textual resources coming from different digitization projects or software tools. It is also likely for MUFI characters to be incorporated into the Unicode itself (e.g. 152 of MUFI characters were added to the Unicode 5.1). For details on MUFI please see section Error: Reference source not found.
Descriptive metadata format
Recommended: DCMES (Dublin Core), MODS
The most popular descriptive metadata format is Dublin Core (the full name is Dublin Core Metadata Element abbreviated as DCMES), which is globally recognized ISO standard. 71% of existing recommendations and 59% of survey respondents has indicated it as the main format for descriptive metadata in the context of long-term preservation. It is a simple and easy to use XML-based format. The simplicity of DCMES is an advantage and disadvantage at the same time. It is good because thanks to simplicity many institutions can easily use it. It is bad because the meaning of particular elements in the standard is not strict, which may cause various misunderstandings. If more detailed description is needed Dublin Core Metadata Initiative Terms (DCTerms) can be used, as those include all the elements from DCMES, and add additional ones, which allow for more precise description.
MODS format is quite popular with relatively high adaptation in the user community (16% of respondends use it for preservation, 47% of existing recommendations indicate it as a good option). MODS is based on XML, it can contain a richer description than Dublin Core, and is also based on MARC21 (though is not able to carry full MARC21 records), therefore can be easily created from existing MARC21 records.
MARC21 was also indicated in existing recommendations and survey. Nevertheless it is not highly recommended as it has several issues with interoperability. It has a specific encoding scheme for transportation purposes (MARC21 communication format), but it is not simple, it is not self-descriptive and definitely it is not human-readable. Additional complication is the possibility to encode MARC21 records using different encodings. It may cause additional issues, as for instance the offsets indicated in MARC21 leader (header) depend on characters and not bytes (and some characters can occupy more than one byte – depending on the encoding). It means that encoding needs to be know beforehand (before processing) and it is not available in the file itself. Because of these reasons the MARC21 format is proposed as alternative.
Structural metadata format
For structural metadata the only option is METS format. In practice there is no real alternative for the format. It is already used by 36% of survey respondents and it is indicated by existing recommendations in 59% of cases. It is an XML-based open standard, simple to apply and supporting various specific formats, including MODS, ALTO, TextMD, MIX and PREMIS (which are all recommended by Succeed project) . It is therefore the best option (and in practice the only one) to be used for structural metadata for long-term preservation.
Administrative metadata format
Recommended: PREMIS, MIX, TextMD
In case of administrative metadata existing recommendations and survey respondents indicate PREMIS for preservation and MIX or NISO Z39-87 for technical metadata of still images. TextMD is recommended as a technical metadata format for textual documents.
MIX is and XML-based format and the most popular implementation of the NISO Z39-87 standard. It can be also easily integrated with METS. It is therefore recommended for storing technical metadata about still images. PREMIS is in fact the only format used in practice to store preservation metadata. 41% of existing recommendations and 22% of survey respondents has indicated it. PREMIS can be also easily integrated with METS format, as it is XML-based. It is actively developed (currently the Editorial Board works towards version 3.0) and has its own PREMIS ontology for information exposure over semantic technologies. TextMD is not widely used by institutions from the survey. It is also not largely pointed by existing recommendations. If fact no indications are given for technical metadata of textual documents. This is why it seems to be a reasonable option to use a format, which is already well-integrated with structural metadata recommendations or preservation recommendations. TextMD is such a format – it is XML-based format and can be easily used in METS format as well as in PREMIS. It is also supported by characterization tools (e.g. JHOVE).
OCR results format
Recommended: ALTO, PAGE
Alternative: UTF-8 encoded plain text
ALTO format has been indicated by 29% of existing recommendations. It is a format, which was developed to extend METS in order to provide both information about coordinates (ALTO format) as well as structural information (METS). The benefits and disadvantages of ALTO have been pointed in section Error: Reference source not found. The main advantages include interoperability, readability (XML-based) and simplicity. The main disadvantages are related to limited number of supported region types and lack of support for capturing logical structure (this needs to be done by format container like METS). The ALTO format exports are also supported by some of the commercial OCR engines and is also a selection for ongoing initiatives (e.g. Europeana Newspapers project).
One of the main design goals of the PAGE format was to enable detailed and accurate description of any information which can be derived from a given document image, by overcoming limitations of existing formats (like ALTO) and allowing its use in applications requiring a very precise content representation (such as performance evaluation). The PAGE format does not have wide range of users, but it gains more and more attention, as it is used in such initiatives and projects like IMPACT Centre of Comptence, eMOP, Europeana Newspapers or Transcriptorium.
The alternative format is a simple text file encoded with UTF-8. The reason for it to be alternative is lack of support for structural information, as the file is simply stream of characters. We recommend using UTF-8 encoding as it is compatible with ASCII and is able to encode various diacritics. It is also worthwhile to use normalized forms of UTF-8 to store OCR results in such text files.
In case of historical documents, especially those with special characters not currently available in the Unicode standard, we recommend using MUFI specification (code points) to be used when training OCR engine (which results in MUFI characters in OCR output). Such an approach will minimize the risk of code point collisions between textual resources coming from different digitization projects or software tools. It is also likely for MUFI characters to be incorporated into the Unicode itself (e.g. 152 of MUFI characters were added to the Unicode 5.1). . For details on MUFI please see section Error: Reference source not found.
Delivery file format
Recommended: JPEG, PDF, JPEG2000 (JP2), ePUB, MOBI derived from ePUB
Delivery files are for the end user – should be easy to use and simple to display them. It is also worthwhile to consider using several delivery formats for specific digital objects, as different users can have different preferences.
JPEG format has been indicated by most of existing recommendations (82%) and majority of Succeed survey respondents (71%). It is a general purpose image format which uses lossy compression to minimize the size of an image. JPEG is supported by practically all web browsers, including mobile ones.
PDF format is the most popular among Succeed survey respondents (77%) and it is also very popular in existing recommendations (53%). PDF is very popular, but requires special software tools to be displayed on the computer device. Some web browsers have lately added build-in support for PDF (e.g. Firefox and Chrome), so in some cases it is not a barrier anymore. PDF has also support for progressive download. It also supports multiple layers, therefore can be used for images or textual content or both.
JPEG2000 is also an option to consider for online delivery, especially when one wants to provide high-resolution images. JPEG2000 supports tiles and various resolution levels; therefore it is a perfect format for such application. It requires dedicated software tools to display in a user web browser, but there are already tools supporting such features (e.g. IIIF, OpenSeadragon). Thanks to such solutions it is possible to use production master files as a direct source for online delivery of digital content.
For ebook readers textual format is required. The most popular formats in this context are ePub and MOBI, therefore those two formats are recommended in such cases. ePub and MOBI can be directly converted from ALTO, PAGE or UTF-8 encoded plain text. In case of MOBI format it is important to note that it is a proprietary format. The reason for recommending it is that it is the format supported by the Kindle® devices, which are currently very popular in the context of e-book readers. The best approach for using MOBI is to keep ePub as a primary delivery format and convert it (free tools are available) to MOBI to support wide range of users and their devices.
OCR results can be provided either together with the presentation format, e.g. in PDF or as a separate file, which is in format used for OCR results preservation.
Descriptive metadata format for online delivery
Recommended: DCMES (Dublin Core), EDM
Dublin Core Metadata Element Set (DCMES) is a must for each institution that wants to provide descriptive metadata online. It is a basic set of 15 elements, providing general information about the resource. Dublin Core is the most popular metadata format provided online by Succeed survey respondents (69%). It is the only format necessary to be supported when implementing OAI-PMH communication (OAI-PMH is widely accepted metadata harvesting protocol, used by Europeana and Digital Public Library of America). Although it is simple and very popular the main disadvantage is lack of precise interpretation of each element. This may cause inconsistency, e.g. on a level of metadata aggregator.
Europeana Data Model (EDM) has been introduced to enable delivery of richer information to Europeana portal than in case of Dublin Core or Europeana Semantic Elements. EDM was prepared to support all of important requirements from cultural heritage institutions. The idea was to increase interoperability of metadata, leverage semantic technologies, and provide finer granularity and more semantics. The EDM is based on existing formats and standards, such as Dublin Core, SKOS, and OAI-ORE. It is also already used by 22% of survey respondents. It is highly recommended for European institutions to use EDM for exposing metadata about provided content, as thanks to EDM the integration with Europeana is fully possible.
Identification of objects
Recommended: OAI Identifier, DOI
In the context of identifiers there are two main options: OAI Identifier and DOI.
The OAI Identifier is a free solution, which is based on domain names and provides possibility to implement persistent identifiers in repositories, which support OAI-PMH protocol. It does not build on a common infrastructure – it relies on the digital repository which implements OAI-PMH protocol and provides OAI Identifiers in OAI-PMH communication. It relies on domain names, which means that one part of the OAI Identifier contains domain name of the service providing OAI-PMH functionality. As a consequence it may introduce some confusion, e.g. when domain name is changed.
DOI is a paid service for keeping persistent identifiers of digital content. DOI is based on the Handle System and used by multiple of institutions (15% of respondents). DOI has been selected over the Handle System (which is also used by 15% of respondents) because it adds additional features, including persistence, consistency and robust technical infrastructure. The benefit of such an approach is reliable and existing infrastructure (provided by Handle System and DOI) as well as independence of specific technology (as opposed to OAI-PMH which is based on domain names).
Advanced and supporting technologies
Advanced and supporting technologies in the digitization related activities have a potential to improve interoperability, processing time and quality of the whole digitization workflow. We have investigated three aspects: semantic technologies, linguistic resources and tools packaging.
Linked Open Data
Recommended: RDFa, SPARQL
The Linked Open Data (LOD) paradigm introduces a new way of thinking about resources available on the web. The main idea behind LOD is to have the resources interlinked with other resources, so that it is easy to discover new resources and find relations between them. The term open suggests to have the data available using the open licenses, such as Creative Commons 0 Public Domain Dedication (which is used by Europeana). There are multiple standards related to semantic technologies, which can be used when publishing resources over the web. Those, which are maintained by theW3C include RDF, OWL, SPARQL, RDFa, SKOS and RDFS.
We recommend investigating two standards when considering Linked Open Data: RDFa for representing RDF triples on the website and SPARQL for querying information available in RDF store. Both are maintained by the W3C. Obviously there are other options, which can be as well considered; nevertheless those two standards seem to be most appropriate for general purpose.
RDFa is a standard which makes it possible to embed RDF triples into HTML, XHTML or XML documents. RDFa enables an easy way for exposing resources and information in form of Linked Data. The features of RDFa can be used in a limited way (making implementation very simple – RDFa Lite) or fully, but then requiring more expertise (RDFa Core). As a result semantic information can be extracted from the website (e.g. from a digital library) by automated tools and then further processed. RDFa itself is already used by 21% of survey respondents, which is 62% of those who use semantic technologies.
In order to enable more advanced access to resources it is recommended to build a SPARQL interface for preserved data. SPARQL is a query language for RDF and a common way of accessing information stored in RDF (it is used by 8% of survey respondents, which is 23% of those who use semantic technologies). In order to provide SPARQL interface (endpoint) it is necessary to have an RDF datastore, which is a kind of database for RDF triples (also called triplestore). Such a datastore can be build up, for example, from information available on the web in RDFa standard.
Recommended: TEI, CMDI or LMF
For discovery, retrieval and reuse of linguistic data it is important that the data is stored in a predictable format. There are many elements that can be preserved in the context linguistic resources, we focus here on corpuses and dictionaries, which can be helpful when improving OCR techniques in the digitization workflow.
TEI format is primarily semantic rather than presentational; the semantics and interpretation of every tag and attribute are specified. Some 500 different textual components and concepts (word, sentence, character, glyph, person, etc.); each is grounded in one or more academic discipline and examples are given. TEI Lite is an XML-based file format for exchanging texts. It is a manageable selection from the extensive set of elements available in the full TEI Guidelines. TEI offers tools like ODD and ROMA, which assists a user in choosing a subset from the TEI repertoire. For linguistic resources special customization is available, called TEI Corpus. TEI is also already present in the cultural heritage sector. Therefore it is worthwhile to consider its use as well, especially for those who already use TEI for other purposes.
Component Metadata Infrastructure (CMDI) is developed within the CLARIN project. It provides a framework to describe and reuse metadata blueprints. Description building blocks (“components”, which include field definitions) can be grouped into a ready-made description format (a “profile”). Both are stored and shared with other users in the Component Registry to promote reuse. Each metadata record is then expressed as an XML file, including a link to the profile on which it is based. The metadata is stored in repositories which are harvested. CLARIN provides a central portal for discovery of resources (CLARIN Visual Language Observatory). Moreover, CLARIN makes special software available for editing CMDI records (Arbil). CLARIN aims to provide an infrastructure for research within Europe including libraries and public archives. This infrastructure will not be available to parties outside that domain like commercial enterprises and individuals.
Lexical Markup Framework is an ISO 24613:2008 standard. The goals of LMF are to provide a common model for the creation and use of lexical resources, to manage the exchange of data between and among these resources, and to enable the merging of large number of individual electronic resources to form extensive global electronic resources. Types of individual instantiations of LMF can include monolingual, bilingual or multilingual lexical resources. The same specifications are to be used for both small and large lexicons, for both simple and complex lexicons, for both written and spoken lexical representations. The linguistics constants like /feminine/ or /transitive/ are not defined within LMF but are recorded in the Data Category Registry (DCR) that is maintained as a global resource by ISO/TC37 in compliance with ISO/IEC 11179-3:2003. And these constants are used to adorn the high level structural elements. LMF is relatively new, but has already gained considerable popularity. According to some linguists the standard is not strict enough. ISO addressed that issue by creating reference structures for several subdomains.
Recommended: package tools for targeted operating systems (at least for MS Windows and Linux)
Tools packaging is one of the elements that makes the maintenance and uptake of new tools easier. The benefit of having specific packages for certain operating systems is that the installation process can be automated. For example in case of Linux systems packaging provides means to install or update software packages, including shortcuts and command line tools. It can also automatically add documentation (e.g. to manpages). This would not be possible without a software package (although it would be possible to simply run a software from binaries, but then with no deep integration with the operating system). It is therefore highly recommended to use tools packaging techniques in order to deliver software to the end users. From the survey analysis it seems that MS Windows is the most popular operating system (87% of respondents). Linux is the second one (49%). Unix and MacOS have approx. 10% popularity each. This indicates that when building software packages at least MS Windows and Linux should be supported, so that most of the potential users can use automated installation procedure.