Frequently Asked Questions
Where Optical Character Recognition is to be applied to text-based material, the minimum recommended standard is 8-bit greyscale at 300 pixels per inch. Where text is very small a resolution of 400-600ppi is recommended.
If the images are being created to preserve significant detail above and beyond Optical Character Recognition, the following formats and standards are recommended:
For sources containing text with black-and-white illustrations (pictures or shades of grey): TIFF or JPEG2000, resolution of 300 ppi, bit depth of 8bit, with no compression or lossless compression.
For sources containing text with or without line drawings: TIFF or JPEG2000, resolution 600 ppi, bit depth of 1, with no compression. If compression is necessary for storage purposes, lossless is recommended.
For more information on standards, see the Image Capture chapter in the IMPACT Decision support tools: <available shortly>
For more information on digital storage, see the IMPACT Storage Estimator (ISE).
In general, the best OCR results will be produced by creating images at a resolution of 300 pixels per inch (ppi) or above, in colour or grey-scale. These standards will preserve the vast majority of the detail in the original item, where lower resolutions will result in progressively worse OCR recognition.
Digital image files can generally be divided into two groups: those created for preservation and those created for access and use. In institutions with a developed digitisation workflow, it is usual practice to create the preservation master image first and derive any access images from it. This sometimes means that the access image will be of deliberately lower quality (and smaller in terms of bytes) than the preservation master, in order to speed up transfer across a network.
There are four main types of access image:
- Thumbnail/monitor images for web and multimedia presentation – normally a compressed RGB JPG, in an sRGB colour space, scaled down to the appropriate size.
- Commercial print images – uncompressed TIFF files within an RGB colour space
In-house print images – uncompressed TIFF files within an RGB colour space, scaled to the appropriate size. This may be identical to the Optimised Master Image.
- Master monitor image – usually a resized optimised master, an uncompressed (or losslessly compressed) TIFF or PNG file in Adobe RGB 1998 or sRGB colour space.
If images are being created purely for OCR, with no other usages envisaged, then it is recommended that images be created at a resolution of 300 pixels per inch (ppi) or above, in colour or grey scale.
Scanning from microfilm is a developing field and more research is needed. However, it is generally done when a very large amount of material, of more or less uniform size and type, needs to be digitised for OCR. Newspapers are the most common content type to be treated this way. Creating microfilm for OCR means that the actual character recognition process takes less time and manual intervention, which can result in a cost saving overall.
For more information on this topic, see the pilot Case Study – Scanning From Microfilm: A Case Study.
Large-scale digitisation projects may require input from external suppliers at any stage. For instance, a digitising institution may lack the physical capacity to undertake the necessary work in-house, or there may be a lack of knowledge within the institution about how to cost-effectively digitise very complex material (e.g. newspapers). In such cases, outsourcing tasks to a third-party can be greatly beneficial in terms of scoping, managing and costing a digitisation project from beginning to end.
While outsourcing may deliver a more streamlined and cost-effective workflow, such collaboration can also bring potential complications: for instance, the necessity of defining the legal ownership of the resulting work, the possible use of proprietary software in creating the image files and metadata (which may limit the reuse potential of the work), restrictions on access for commercial reasons, etc.
Outsourcing a part of the workflow also means that the digitising institution is less likely to retain working knowledge of that part of the workflow for its next project.
In-house digitisation will tend to give an institution more flexibility in terms of fine-tuning workflows and infrastructure to hit their targets. Processing in-house will also generally have the effect of building capacity for digitisation in these institutions, gradually removing the dependence on commercial suppliers and their pricing standards.
In brief terms, then, the decision to outsource should based first on a thorough assessment of the material to be digitised, and then on the institution’s own capacity to do the work.
Optical Character Recognition (OCR) is the electronic translation of text-based images into machine-readable and editable text. Usually performed by software devices as part of a digitisation workflow, OCR works by performing a layout analysis of a digital image and breaking that image into smaller structural components to find zones of textual content. These zones will include the overall area of the page that features text. Within that zone the OCR software will identify individual lines of text, and within those lines will identify individual words and characters.
Many OCR software suites are available for many types of use, and each runs to slightly different standards and methodology. At its simplest, however, all OCR software follows the same basic principle: once the software engine has identified a single character, it runs that character’s properties through an internal classification of text fonts to find a match. It repeats the process for all characters within a word, and then runs that information through a dictionary of complete words to find a match. It extends this process through sentences, lines and text blocks until – ideally – all text in the image has been identified.
Any digitisation project investigating OCR should have three main considerations:
- Suitability of material for OCR – some types of material are better suited to OCR than others. For instance, handwritten or manuscript material will rarely give the uniformity of character needed for OCR software to be effective; badly damaged paper can lead to machine-unreadable text, etc. But if a project decides to OCR, the second consideration comes into play: namely, the level of accuracy needed in order to deliver a satisfying experience to end users.
- Accuracy threshold needed – OCR software packages promise a certain level of accuracy within defined parameters.They will tend to define accuracy as a percentage figure of characters recognised correctly, per the total volume of characters converted. Moreover, the accuracy will tend to have been measured on an ‘ideal’ document and will not therefore give a true indication of how the software will work on historical material. Many institutions engaged in mass digitisation of historic text materials using OCR will set an acceptable threshold of OCR accuracy in advance of scanning, and visually check the OCR accuracy against that target on random batches of material. Most institutions will factor this checking into the workflow of their digitisation projects and programmes (i.e. it is an ongoing task through the digitisation lifecycle). For a one-off project, however, it is usually less costly and time consuming to digitise a smaller selection of relevant material first and test the practical OCR accuracy on this sample. This has the distinct advantage of pointing up the potential problems presented by the material and/or image capturing process, and allows an institution to change its workflow or capture standards to take these problems into account before ramping up to full production.
- Potential further use of OCR results – if OCR results are being used primarily for indexing and retrieving online or through a catalogue (i.e. where the user will never see the actual OCR results), the absolute accuracy of the results is sometimes less significant. Search engines such as Google employ fuzzy searching, where a misspelled or misrecognised word will be matched against the actual word it most resembles, so a strict 90% character accuracy level can nevertheless result in a 98% retrieval rate. Using OCR as an indexing tool is increasingly common practice in large-scale and mass digitisation.
In practice, OCR can be carried out at any time once digital images have been created. With the rise of automated book scanners and their accompanying software suites, it has become common to use OCR immediately after the image has been created and saved. This is certainly the most commonsense approach when designing a largescale digitisation workflow; in particular as OCR tends to produce its best results from optimised master images. However, any text-based digital image can be run through an OCR engine at any time, even legacy images not created for the purpose.
IMPACT recommends that before the decision to create OCR has been taken for a project, a small representative sample of every type of material should be run through an OCR engine to see if the results are good enough to justify making OCR a central part of the workflow.
The most effective and comprehensive way of measuring the accuracy of OCR results remains manual revision, but this is cost prohibitive at any level, and unworkable as a single method of evaluation in a mass digitisation context.
One of the simplest alternatives has been to manually check an OCR engine’s software log for a random batch of files. The software log is where the OCR engine documents its own (self-perceived) success rate, ranking character and word matches against the software’s internal dictionary. Checking the log therefore enables the user only to assess how successful the software thinks it’s been, and not how successful it’s actually been.
Pursuing a similar methodology, but in more depth, are human eye comparisons between the text in digital image file and a full text representation of the OCR output from that text. This is obviously more time consuming than checking against the software log, but gives a much more accurate idea of the success rate of the OCR. Key to this method is ensuring that the sample images checked are representative of the entire corpus to be digitised, in type and quality; otherwise the results for the particular sample can be far from the overall average.
All of these methods can be used at a project’s start-up phase – as a benchmarking exercise for both hardware and software – or throughout a project’s life cycle. But because they are necessarily limited in scope due to any project’s timescale and resourcing, they tend to wane in importance as a project progresses. A simple, statistical method for monitoring OCR success throughout a project is to include the software log’s success rate in the OCR output file, or at least keep it separately. Looked at en masse, it will give an overview of where the OCR engine thinks it’s succeeding and where it thinks it’s failing. If there are wide discrepancies between one batch of files and another, the software log will allow the institution to prioritise those files where OCR accuracy is low, and to manage (and hopefully mitigate) those discrepancies.
In general, yes. IMPACT recommends that files created for OCR are captured in at least 8-bit colour-depth, at least 300 pixels per inch (ppi), using lossless compression. Modern documents may be recognised efficiently at lower standards, but with historical documents there is a large risk of obscuring significant detail or introducing digital noise to the image.
OCR can be improved by the following methods, not all of which will be practicable within a single digitisation workflow:
- Select material on the primary basis that it is suitable for OCR
- Scan at resolution of 300 ppi or above
- Use TIFFor JPEG2000 files with lossless or no compression for OCR
- Test image optimisation software for best results on certain content
- Using image optimisation software, automatically de-skew every page on a vertical and horizontal grid so that image text is as horizontal as possible
- Use voting technology (i.e. use more than one OCR software solution and voting technology picks the best results), while being aware that this will add cost and processing time
- Use an appropriate or dedicated dictionary for the OCR engine to see if a more accurate vocabulary results in greater accuracy
- Use language modelling software within OCR engine to improve intelligent recognition of characters and words (e.g. train a machine to understand that in English a ‘q’ is always followed by a ‘u’)
- Clean some or all of OCR text manually
In a page with 1,000 characters and 100 words, an OCR engine may register an accuracy level of 99%. This means that ten characters on the page have been recognised incorrectly. However, these ten characters may be distributed between ten words, which would mean that word accuracy is only 90%. In practice it is only possible to assess the level of word accuracy by visual intervention.
Text transcription should be preferred when the material is too old, damaged, complex or otherwise eccentric to be converted into accurate OCR. It should also be preferred when the focus of the project is on 100% accuracy of the text.
Most high-end OCR engines will allow you to select from a list of languages and run OCR in all of them. This implies visual inspection of the material before OCR, or a catalogue in which the language details are recorded.