Working with the Virtual Transcription Laboratory
Figure 1 presents the general overview of Virtual Transcription Laboratory workflow. After registering a new account users can create projects based on scanned images coming from their own assets or from the existing digital library. When content is already in place users can create a base version of transcription and start to collaborate on its correction and enrichment. Results of the work can be exported in variety of formats.
After registering in the VTL users can login and create new projects. Projects are a basic unit which organizes the work in VTL. Project contains images, some basic metadata and transcription. Creation of the project starts with typing in the information about the transcribed document which include: name of the project, title and creator of original document, keywords describing both project and the document, type of content (printed, manuscript or other) and finally, the language(s) of the text. This metadata can be keyed manually or imported from the existing description of the object from digital library.
Import of Scanned Images
The following step includes addition of scanned images. VTL supports upload of files from the hard disk of user’s computer or direct import of certain kinds of documents from the digital library (only some digital libraries are supported at the moment).
VTL support upload of images in most widely used graphical formats, including PNG, JPG, TIFF, DjVu and PDF files. File in TIFF/PDF/DjVu formats are automatically converted into PNGs. Apart from uploading single files, users can also upload a ZIP archive containing images. VTL will automatically extract the archive and add (or convert if necessary) uploaded files to the project.
Second possibility of adding files is a direct files import from one of the supported digital libraries. This mechanism is based on OAI identifiers and is specific for dLibra-based (http://dlibra.psnc.pl) digital libraries.
Operations such as import of files from the remote server or conversion of several dozens of TIFF files is executed as a non–blocking, asynchronous task. User can close the current web browser window and work on something different. When asynchronous task is completed VTL will notify user by sending an email message to account associated with the user account. At any time they can also monitor the current execution of a given task by looking at a dedicated page in their profile.
Base Version of Transcription
Transcription in VTL, apart from text itself contains also its coordinates in the original image and information about annotations/comments.
There are several ways to create a base version of transcription in VTL. In most cases, the base version of transcription will be created using VTL’s OCR service. But apart from this, initial version of transcription can be created by manual keying of text from images or by import of existing transcription, e.g. results of recognition from standalone OCR engine (eg. Tesseract or OCRopus),
Manual keying is a tedious task but in some cases it might be the only solutions to create searchable text. When performed in VTL, work can be easily distributed among a group of volunteers which should significantly speed up the whole process.
The most comfortable option for creation of the base transcription is the usage of VTL’s OCR service. Users can go through the project and invoke OCR processing on each scanned image manually or run a batch processing on all files in the project. Batch OCR is executed as an asynchronous task, user will be notified via email when the process is finished.
Web interface of the OCR service allows to process the whole page or just a fragment of it. This might be useful while dealing with pages where several languages were used next to each other. User can mark the first half of the page which was printed in one language and run the OCR. When the first part of processing is over, user can go back to the OCR interface and process the rest of the page using different language profile.
As it was already mentioned, OCR service returns not only text but also its coordinates. This might be useful when working with manuscripts. Current OCR service does not offer support for connected scripts but it features a good accuracy at marking boundaries of text lines. In this usage scenario users can use only the coordinates of lines (because the quality of recognised text will probably be very poor. This would work as a template and simplify the process of a manual keying of text.
Described OCR service can be customized for better recognition of certain types of documents, customization process is implemented using Cuttouts (http://wlt.synat.pcss.pl/cutouts) application.
Correction of the Transcription
The work of project’s editor consists of the following steps:
- go to the project page,
- select page from the project or click “Go to the transcription editor”,
- review the text and verify if the content of a scan corresponds to the recognised text
- when ready with a given page, editor can go to the next page.
Every text line in the transcription has a status, after verification/correction it goes to a “checked” state. This transition can be done manually (editor marks line as checked) or automatically (if they spend more than 5 seconds working on a given line). Thanks to this mechanism project owner can track the progress of correction process. VTL can also select pages which have the highest number of unchecked lines and redirect project editors to these pages. Project owner can mark all lines at a given page as “checked” or “unchecked”. This feature should help to implement more sophisticated correction workflows.
VTL supports parallel modification of transcription. If two users work on the same file their changes will be automatically merged. All modifications of the project are tracked, project owner can withdraw changes performed by other users at any time.
Export of results
When a quality of transcription is sufficient, project owner can export final text in one of three output formats:
- hOCR – can be used to generate alternative document formats e.g. PDFs, DjVu files,
- plain text/RTF – for processing in popular text editors e.g. Microsoft Word,
- ePUB/Mobi – format dedicated to mobile readers, may be a subject of further processing in applications like Sigil (https://code.google.com/p/sigil/) or Calibre (http://calibre-ebook.com/).
Adding new output formats is relatively easy and the list of supported formats will be extended in the future.