Tools for text digitisation

More than
250
state-of-the-art tools for text digitisation.

286 results

Tools

ocropus

  • Description:OCRopus is an OCR system focusing on the use of large scale machine learning for addressing problems in document analysis
  • Group: text recognition
  • Type: core text recognition
  • Subtype: ner
  • License:
  • Language: n/a
  • Developer: ocropus project

open-jpeg

  • Description:The OpenJPEG library is an open-source JPEG 2000 codec written in C language. It has been developed in order to promote the use of JPEG 2000, the new still-image compression standard from the Joint Photographic Experts Group (JPEG). In addition to the basic codec, various other features are under development, among them the JP2 and MJ2 (Motion JPEG 2000) file formats, an indexing tool useful for the JPIP protocol, JPWL-tools for error-resilience, a Java-viewer for j2k-images, ...
  • Group: image processing
  • Type: image processing and enhancement
  • Subtype: image enhancement
  • License:
  • Language: n/a
  • Developer:

post correction tool

  • Description:Interactive post-correction of OCRed documents
  • Group: text recognition
  • Type: postcorrection
  • Subtype: ner
  • License:
  • Language: n/a
  • Developer: centrum für informations und sprachverarbeitung (cis) university of munich

pyBossa

  • Description:Open-source crowd-sourcing (microtasking) platform with a focus on volunteer contribution and making it super-easy to create a crowd-sourcing app.
  • Group: Miscellaneous Utilities
  • Type: -
  • Subtype: Transcription
  • License: GPLv3
  • Language: -
  • Developer: Daniel Lombraña González (Citizen Cyberscience Centre) Rufus Pollock (Open Knowledge Foundation) David Anderson (BOINC / Berkeley (via BOSSA))

stanford ner

  • Description:Stanford NER (also known as CRFClassifier) is a Java implementation of a Named Entity Recognizer. Named Entity Recognition (NER) labels sequences of words in a text which are the names of things such as person and company names or gene and protein names. The software provides a general (arbitrary order) implementation of linear chain Conditional Random Field (CRF) sequence models coupled with well-engineered feature extractors for Named Entity Recognition. (CRF models were pioneered by Lafferty McCallum and Pereira (2001); see Sutton and McCallum (2006) for a better introduction.) Included with the download are good 3 class (PERSON ORGANIZATION LOCATION) named entity recognizers for English (in versions with and without additional distributional similarity features) and another pair of models trained on the CoNLL 2003 English training data. The distributional similarity features improve performance but the models require considerably more memory.
  • Group: text processing
  • Type: nlp tools
  • Subtype: ner
  • License:
  • Language: n/a
  • Developer: http://nlp.stanford.edu/index.shtml

tb-transcription-desk

  • Description:MediaWiki based environment for a distributed collaborative transcription effort.
  • Group: Miscellaneous Utilities
  • Type: -
  • Subtype: Transcription
  • License: GPLv2
  • Language: -
  • Developer: University Collage London

tesseract

  • Description:Tesseract is probably the most accurate open source OCR engine available
  • Group: text recognition
  • Type: core text recognition
  • Subtype: ner
  • License:
  • Language: n/a
  • Developer: tesseract project
  • Wiki

tifftool

  • Description:Tifftool is a high-performance tool to clean scanned documents in preparation for onscreen display or for OCR
  • Group: Image Processing
  • Type: Image Processing and Enhancement
  • Subtype: -
  • License: GPL v2
  • Language: -
  • Developer: Paul K. Young

typewritten ocr

  • Description:OCR Prototype for recognising typewritten documents incorporating background knowledge about the specific features of this type of documents.
  • Group: text recognition
  • Type: core text recognition
  • Subtype: ner
  • License:
  • Language: n/a
  • Developer: university of salford (prima)

unpaper

  • Description:Unpaper is a post-processing tool for scanned sheets of paper especially for book pages that have been scanned from previously created photocopies. The main purpose is to make scanned book pages better readable on screen after conversion to PDF. Additionally unpaper might be useful to enhance the quality of scanned pages before performing optical character recognition (OCR).
  • Group: image processing
  • Type: image processing and enhancement
  • Subtype: ner
  • License:
  • Language: n/a
  • Developer: -


Would you like to add any tool?

Registered users can add new tools through a simple form login or register.

Search or filter tools

Group:

Type:

Subtype:

In demonstrator platform: