The Impact Centre of Competence Demonstrator platform allows users to test a number of tools online without installing any software on their computers. These tools cover all steps in the digitisation workflow such as image conversion, image enhancement, ocr and evaluation tools.
Image Conversion
GraphicsMagick
Graphics magick provides a robust and efficient collection of tools and libraries which support reading, writing, and manipulating an image in over 88 major formats, including important formats like DPX, GIF, JPEG, JPEG-2000, PNG, PDF, PNM, and TIFF.
Source: Project GraphicsMagick.
Image Enhancement
Ocropus Binarisation and Dewarping service
Ocropus binarisation and dewarping servicePerforms the binarisation and dewarping processing using the Ocropus technology.
Segmentation
Fraunhofer Newspaper Segmenter & Korrektor
OCR Engines
Tesseract 3.03 OCR Service
Perform OCR on an input image file using Tesseract 3.03 technology.
Evaluation
IMPACT INL OCR Evaluation Service
Performs OCR evaluation by comparing the results with ground truth.
Tools classified according to their purpose
Image conversion
- Graphics Magick
Graphics magick provides a robust and efficient collection of tools and libraries which support reading, writing, and manipulating an image in over 88 major formats including important formats like DPX, GIF, JPEG, JPEG-2000, PNG, PDF, PNM, and TIFF. Learn more. - ImageMagick conversion to PGM
Converts an image into Portable Graymap format (PGM) using Image Magick. - IMPACT OpenJPEG Conversion Service
Perform conversion from JPEG2000 to TIFF, BMP, RAW, etc. image file formats. Implementation is based on the OpenJPEG library. - Kakadu
Kakadu is a (commercial) software library for the encoding and decoding of images in JPEG2000 format.<!-- - Gimp Image Conversion
GIMP is a raster graphics editor[5] used for image retouching and editing, free-form drawing, resizing, cropping, photo-montages, converting between different image formats, and more specialized tasks. - ImageMagick Conversion
- Exiftool
ExifTool is a free software program for reading, writing, and manipulating image, audio, and video metadata. It is platform independent, available as both a Perl library (Image::ExifTool) and command-line application. ExifTool is commonly incorporated into different types of digital workflows and supports many types of metadata including Exif, IPTC, XMP, JFIF, GeoTIFF, ICC Profile, Photoshop IRB, FlashPix, AFCP and ID3, as well as the manufacturer-specific metadata formats of many digital cameras.
-->
Image Enhancement
- Image Magick Border removal
Performs image enhancement by automatically detecting and removing black borders as well as noise regions from scanned document image files using Image Magick. - <!--
- IMPACT NCSR Border Removal Service
Perform image enhancement by automatically detecting and removing black borders as well as noise regions from scanned document image files. Learn more - IMPACT NCSR Geometric Correction Service
Perform image enhancement by automatically correcting geometric distortions typically found in scanned document image files. Learn more. - Galfar's Lair Deskew
Straightens an image to improve the detection of structures. - <!--
- NCSR Binarisation Service
Performs image binarisation using an algorithm developed at NCSR. - Fraunhofer IAIS mydec Deskewer
mydec is software for automatic and manual media development for cultural and media organizations.It provides metadata from the cloud, making it possible to browse media content, to combine and distribute the web. The alignment of input images is corrected in order to improve the detection of structures and text. Learn more. - Fraunhofer IAIS mydec Color Binarize
Color binarize separates letters from the background. Grayscale images are converted to binary. It can be calculated for the separation either for the entire image or for each pixel of the optimal contrast. Learn more. - Fraunhofer IAIS mydec Deshadow
The natural aging of paper can cause the contrast ratio between paper and writing deteriorated. Such aging effects can be removed automatically to support the subsequent development. Fraunhofer IAIS mydec Deshadow removes this aging effects. Learn more. - Ocropus binarisation and dewarping service
Performs the binarisation (converts pixels into black and white) and dewarping (perspective correction) processing using the Ocropus technology. - Unpaper
unpaper is a post-processing tool for scanned sheets of paper, especially for book-pages scanned from previously created photocopies. unpaper tries to remove dark edges, corrects the rotation ("deskew"), and aligns the centering of pages. - ABBYY FineReader 11 Binarisation Service
Performs the binarisation processing using the Abbyy FineReader11 technology. - Scan Tailor
Scan Tailor is an interactive post-processing tool for scanned pages. It performs operations such as page splitting, deskewing, adding/removing borders, and others. You give it raw scans, and you get pages ready to be printed or assembled into a PDF or DJVU file. Learn more.
!-->
-->
Segmentation
- IMPACT ABBYY FineReader 10 PAGE Segmentation Service
Perform segmentation of an input image file using ABBYY FineReader 10 and export the results in PAGE format. Learn more. - Fraunhofer Newspaper Segmenter & Korrektor
The Korrektor is a manual post-correction tool for automatically processed newspaper scans. By loading the result XML files into the software, it is possible to correct automatically detected layout elements, texts and other properties. The scanned documents are displayed in two separate windows to allow for a detailed inspection.
Results can be edited using context menus, drag and drop and keyboard shortcuts.
OCR
OCR Training
- Cutouts
Cutouts supports preparation of the proper training material for the OCR system. As a proper training material we understand a set of shapes (areas) separatedfrom the source document composing a font used for a print of a given document. Learn more.
OCR Engines
- Abbyy FineReader 11 SDK version
ABBYY FineReader is an optical character recognition (OCR) software that provides unmatched text recognition accuracy and conversion capabilities, virtually eliminating retyping and reformatting of documents. Intuitive use and one-click automated tasks let you do more in fewer steps. Up to 190 languages are supported for text recognition — more than any other OCR software in this market. - Abbyy FineReader 11 with Impact User dictionaries
Performs the OCR recognition using the Abbyy Fine Reader Technology with the external dictionaries developed during the Impact project. - IMPACT Tesseract 3.03 OCR Service
Performs OCR on an input image file using Tesseract 3.03 technology. - Tesseract PAGE XML output v1.3
Performs OCR on an input image file using Tesseract 3.03 technology and exports the output into PAGE XML format. More info. - IMPACT Tesseract 4.0 OCR Service
Performs OCR on an input image file using Tesseract 4.0 technology. - Gocr
GOCR is an OCR (Optical Character Recognition) program, developed under the GNU Public License. It converts scanned images of text back to text files. - OCRad
GNU Ocrad is an OCR (Optical Character Recognition) program based on a feature extraction method. It reads images in pbm (bitmap), pgm (greyscale) or ppm (color) formats and produces text in byte (8-bit) or UTF-8 formats. - Gamera OCR Module
Gamera is a Python framework for building document analysis applications. - Cuneiform
CuneiForm is a software tool for optical character recognition. It was originally developed at Cognitive Technologies and, after a few years with no development, released as freeware on December 12, 2007. The kernel of the OCR engine was released under the open source BSD license license at the beginning of April 2008.
Evaluation
- <!--
- IMPACT NCSR OCR Evaluation Service
Perform OCR evaluation by comparing text results from an OCR engine with ground truth text data. - IMPACT UA OCR Evaluation Service
This OCR evaluation tool allows for the comparison of the reference text with the OCR output and also for the comparison of the output of two different OCR engines. - IMPACT INL Word Evaluation Service
Performs word evaluation of OCR by comparing the results in PAGE format with ground truth. - IMPACT INL OCR Evaluation Service
Performs OCR evaluation by comparing the results with ground truth. - TextEval
Texteval is an alternative OCR evaluation tool from PRImA research. - Impact USAL Layout evaluation
-->
File Type and Encoding
- IMPACT Iconv Encoding Conversion Service
Perform conversion of character encoding using Iconv. - XML to text
Perform various XML conversion into TXT. - IMPACT USAL Ground Truth Normalisation Service
Perform normalisation of ground truth in PAGE format according to pre-defined filter rules. - Xsltproc
Xsltproc is a command line tool for applying XSLT stylesheets to XML documents. It is part of libxslt, the XSLT C library for GNOME.
Other
Storage Estimator
- IMPACT Storage Estimator
This tool will estimate the storage required for the images and OCR output files made within your digitisation workflow.
Cost Estimator
- IMPACT Digitisation Cost Estimator
This tool will estimate the overall cost of undertaking a digitisation project.
Other
- IMPACT INL Named Entities Recognition Service
Perform recognition and tagging of named entities (persons, locations and organizations) in a text file. Learn more. - INL Lemmatizer
INL-developed tagger-lemmatizer for historical Dutch, where the tagger is trained on the "Letters as loot" corpus and the lemmatizer
is based on the INL historical lexicon. Learn more. - Decompress
This tool decompress a compressed file. It is useful in the execution of workflows. - JHOVE2
The JHOVE2 project generalizes the concept of format characterization to include identification, validation, feature extraction, and policy-based assessment. Learn more. - Stanford NER
Stanford NER is a tool that can mark and extract named entities (persons, locations, organizations or even titles) from a text file. It uses a supervised learning technique, which means it has to be trained with a manually tagged training file before it is applied to other text. Learn more. - Mallet
MALLET is a Java-based package for statistical natural language processing, document classification, clustering, topic modeling, information extraction, and other machine learning applications to text. - Catlinux
- BVC Geonames Disambiguation
Disambiguation tool for geographic locations using external repositories such as Wikidata and Geonames. Available with MIT License.