Tools for text digitisation

More than
250
state-of-the-art tools for text digitisation.

283 results

Tools

pyBossa

  • Description:Open-source crowd-sourcing (microtasking) platform with a focus on volunteer contribution and making it super-easy to create a crowd-sourcing app.
  • Group: Miscellaneous Utilities
  • Type: -
  • Subtype: Transcription
  • License: GPLv3
  • Language: -
  • Developer: Daniel Lombraña González (Citizen Cyberscience Centre) Rufus Pollock (Open Knowledge Foundation) David Anderson (BOINC / Berkeley (via BOSSA))

stanford ner

  • Description:Stanford NER (also known as CRFClassifier) is a Java implementation of a Named Entity Recognizer. Named Entity Recognition (NER) labels sequences of words in a text which are the names of things such as person and company names or gene and protein names. The software provides a general (arbitrary order) implementation of linear chain Conditional Random Field (CRF) sequence models coupled with well-engineered feature extractors for Named Entity Recognition. (CRF models were pioneered by Lafferty McCallum and Pereira (2001); see Sutton and McCallum (2006) for a better introduction.) Included with the download are good 3 class (PERSON ORGANIZATION LOCATION) named entity recognizers for English (in versions with and without additional distributional similarity features) and another pair of models trained on the CoNLL 2003 English training data. The distributional similarity features improve performance but the models require considerably more memory.
  • Group: text processing
  • Type: nlp tools
  • Subtype: ner
  • License:
  • Language: n/a
  • Developer: http://nlp.stanford.edu/index.shtml

tb-transcription-desk

  • Description:MediaWiki based environment for a distributed collaborative transcription effort.
  • Group: Miscellaneous Utilities
  • Type: -
  • Subtype: Transcription
  • License: GPLv2
  • Language: -
  • Developer: University Collage London

tesseract

  • Description:Tesseract is probably the most accurate open source OCR engine available
  • Group: text recognition
  • Type: core text recognition
  • Subtype: ner
  • License:
  • Language: n/a
  • Developer: tesseract project
  • Wiki

tifftool

  • Description:Tifftool is a high-performance tool to clean scanned documents in preparation for onscreen display or for OCR
  • Group: Image Processing
  • Type: Image Processing and Enhancement
  • Subtype: -
  • License: GPL v2
  • Language: -
  • Developer: Paul K. Young

typewritten ocr

  • Description:OCR Prototype for recognising typewritten documents incorporating background knowledge about the specific features of this type of documents.
  • Group: text recognition
  • Type: core text recognition
  • Subtype: ner
  • License:
  • Language: n/a
  • Developer: university of salford (prima)

unpaper

  • Description:Unpaper is a post-processing tool for scanned sheets of paper especially for book pages that have been scanned from previously created photocopies. The main purpose is to make scanned book pages better readable on screen after conversion to PDF. Additionally unpaper might be useful to enhance the quality of scanned pages before performing optical character recognition (OCR).
  • Group: image processing
  • Type: image processing and enhancement
  • Subtype: ner
  • License:
  • Language: n/a
  • Developer: -


Would you like to add any tool?

Registered users can add new tools through a simple form login or register.

Search or filter tools

Group:

Type:

Subtype:

In demonstrator platform: