Tools for text digitisation

More than
250
state-of-the-art tools for text digitisation.

35 results in

Tools

korrektor

  • Description:GUI-based software for viewing and correcting document analysis results
  • Group: text recognition
  • Type: postcorrection
  • Subtype: ner
  • License:
  • Language: n/a
  • Developer: fraunhofer iais
  • Wiki

layout evaluation

  • Description:Performance evaluation tool for layout analysis and segmentation methods based on detailed metrics (types of errors such as merges splits missed regions etc.) and use scenarios
  • Group: evaluation
  • Type: layout
  • Subtype: ner
  • License:
  • Language: n/a
  • Developer: university of salford (prima)

line and word segmentation

  • Description:Segmentation of text regions into text lines and words independent of text recognition (OCR).
  • Group: image processing
  • Type: image segmentation
  • Subtype: ner
  • License:
  • Language: n/a
  • Developer: university of salford (prima)

nert

  • Description:NERT is a tool that can mark and extract named entities (persons locations and organizations) from a text file. It uses a supervised learning technique which means it has to be trained with a manually tagged training file before it is applied to other text. In addition version 2.0 of the tool and higher also comes with a named entity matcher module with which it is possible to group variants or to assign modern word forms of named entities to old spelling variants. As a basis for the tool in this package the named entity re cognizer from Stanford University is used. This tool has been extended for use in IMPACT. Among the extensions is the aforementioned matcher module and a module that reduces spelling variation within the used data thus leading to improved performance.
  • Group: text processing
  • Type: nlp tools
  • Subtype: ner
  • License:
  • Language: n/a
  • Developer: info.inl.nl

ocrad

  • Description:GNU Ocrad is an OCR (Optical Character Recognition) program based on a feature extraction method. It reads images in pbm (bitmap) pgm (greyscale) or ppm (color) formats and produces text in byte (8-bit) or UTF-8 formats. Also includes a layout analyser able to separate the columns or blocks of text normally found on printed pages. Ocrad can be used as a stand-alone console application or as a backend to other programs.
  • Group: text recognition
  • Type: core text recognition
  • Subtype: ner
  • License:
  • Language: n/a
  • Developer: -

ocropus

  • Description:OCRopus is an OCR system focusing on the use of large scale machine learning for addressing problems in document analysis
  • Group: text recognition
  • Type: core text recognition
  • Subtype: ner
  • License:
  • Language: n/a
  • Developer: ocropus project

post correction tool

  • Description:Interactive post-correction of OCRed documents
  • Group: text recognition
  • Type: postcorrection
  • Subtype: ner
  • License:
  • Language: n/a
  • Developer: centrum für informations und sprachverarbeitung (cis) university of munich

stanford ner

  • Description:Stanford NER (also known as CRFClassifier) is a Java implementation of a Named Entity Recognizer. Named Entity Recognition (NER) labels sequences of words in a text which are the names of things such as person and company names or gene and protein names. The software provides a general (arbitrary order) implementation of linear chain Conditional Random Field (CRF) sequence models coupled with well-engineered feature extractors for Named Entity Recognition. (CRF models were pioneered by Lafferty McCallum and Pereira (2001); see Sutton and McCallum (2006) for a better introduction.) Included with the download are good 3 class (PERSON ORGANIZATION LOCATION) named entity recognizers for English (in versions with and without additional distributional similarity features) and another pair of models trained on the CoNLL 2003 English training data. The distributional similarity features improve performance but the models require considerably more memory.
  • Group: text processing
  • Type: nlp tools
  • Subtype: ner
  • License:
  • Language: n/a
  • Developer: http://nlp.stanford.edu/index.shtml

tesseract

  • Description:Tesseract is probably the most accurate open source OCR engine available
  • Group: text recognition
  • Type: core text recognition
  • Subtype: ner
  • License:
  • Language: n/a
  • Developer: tesseract project
  • Wiki

typewritten ocr

  • Description:OCR Prototype for recognising typewritten documents incorporating background knowledge about the specific features of this type of documents.
  • Group: text recognition
  • Type: core text recognition
  • Subtype: ner
  • License:
  • Language: n/a
  • Developer: university of salford (prima)

unpaper

  • Description:Unpaper is a post-processing tool for scanned sheets of paper especially for book pages that have been scanned from previously created photocopies. The main purpose is to make scanned book pages better readable on screen after conversion to PDF. Additionally unpaper might be useful to enhance the quality of scanned pages before performing optical character recognition (OCR).
  • Group: image processing
  • Type: image processing and enhancement
  • Subtype: ner
  • License:
  • Language: n/a
  • Developer: -


Would you like to add any tool?

Registered users can add new tools through a simple form login or register.

Search or filter tools

Group:

Type:

Subtype:

In demonstrator platform: