Tools for text digitisation

More than
250
state-of-the-art tools for text digitisation.

286 results

Tools

ImpacTok Tokenizer

  • Description:The tokenizer is used to pre-process documents that form the corpus used to build the lexicon. Tokenization is the process of breaking down a stream of text into words or tokens. This tokenizer is based on ILKTOK, part of the ‘Tadpole’ language processing suite (ilk.uvt.nl/software/). A rewrite of the code was necessary in order to produce the output required for the database used for the IMPACT Lexicon and to introduce a more modular approach.
  • Group: text processing
  • Type: nlp tools
  • Subtype: tokenizer
  • License:
  • Language: n/a
  • Developer: IVdNT

Impact Bulgarian Demonstrator Dataset

  • Description:The Bulgarian ground truth produced by National Library of Bulgaria (NLB) in the frame of the EU funded Impact project consists of 1.276 pages in PAGE XML format with an accuracy of 99.95%
  • Group: Data
  • Type: Groundtruth
  • Subtype: -
  • License: CC-BY-NC-ND
  • Language: Bulgarian
  • Developer: National Library of Bulgaria

Impact Bulgarian Historical Lexicon

  • Description:The current lexicon consists of 28857 lexical entries developed in LeXtractor. The size of the historical lexicon extracted from the manually validated corpus tokens is given the following: 26148 word forms 25861 normalised 21115 modernized and 11090 lemmata.''The lexicon is currently available as LeXtractor and TEI P5 XML and in the IMPACT database structure.
  • Group: Data
  • Type: Language resources
  • Subtype: Historical lexicon
  • License: CC-BY-NC-SA
  • Language: Bulgarian
  • Developer: Bulgarian Academy of Sciences

Impact Bulgarian Institutional Dataset

  • Description:The image collection for Bulgarian language is provided by National Library of Bulgaria. The dataset consists of 4.240 images in high resolution
  • Group: Data
  • Type: Images
  • Subtype: -
  • License: CC-BY-NC-ND
  • Language: Bulgarian
  • Developer: National Library of Bulgaria

Impact Czech Demonstrator Dataset

  • Description:The Czech ground truth produced by Národní knihovna České republiky (National Library of Czech Republic - NKC) in the frame of the EU funded Impact project consists of 5.049 pages in PAGE XML format with an accuracy of 99.95%
  • Group: Data
  • Type: Groundtruth
  • Subtype: -
  • License: CC-BY-NC-SA
  • Language: Czech
  • Developer: National Library of Czech Republic

Impact Czech Historical Lexicon

  • Description:The period covered by the Historical Lexicon of Czech is between 1800 and 1900.The current lexicon is divided in different periods:''- 1801-1809: 16052 lemmata 311362 word forms and 321099 lemma/word forms combinations.''- 1810-1842: 16056 lemmata 297122 word forms and 304711 lemma/word forms combinations.''- 1843-1849: 9406 lemmata 178783 word forms and 183079 lemma/word forms combinations.''- 1850+: 31954 lemmata 506663 word forms and 518628lemma/word forms combinations.
  • Group: Data
  • Type: Language resources
  • Subtype: Historical lexicon
  • License: CC-BY-NC-ND
  • Language: Czech
  • Developer: Charles University Prague

Impact Czech Institutional Dataset

  • Description:The image collection for Czech language is provided by National Library of Czech Republic. The dataset consists of 75.559 images in high resolution
  • Group: Data
  • Type: Images
  • Subtype: -
  • License: CC-BY-NC-SA
  • Language: Czech
  • Developer: National Library of Czech Republic

Impact Ducth Named Entities Lexica

  • Description:The Core Named Entities Lexicon for Dutch is an elaborate database of enriched historical Dutch locations person names and organisations from the period 1750 - 1945. It can be used as a lexicon for OCR and for query expansion in retrieval.
  • Group: Data
  • Type: Language resources
  • Subtype: Named entities lexica
  • License: pending
  • Language: Dutch
  • Developer: Instituut voor Nederlandse Lexicologie

Impact Dutch Demonstrator Dataset

  • Description:he Dutch ground truth produced by Koninklijke Bibliotheek (KB) in the frame of the EU funded Impact project consists of 3.439 pages in PAGE XML format with an accuracy of 99.95%
  • Group: Data
  • Type: Groundtruth
  • Subtype: -
  • License: No restriction (except newspapers: no usage allowed)
  • Language: Dutch
  • Developer: National Library of the Netherlands

Impact Dutch Historical Lexicon

  • Description:The period covered by the Historical Lexicon of Dutch is since 1600 until 1940 and the type of material used is books newspapers and parliamentary papers.''''The Dutch IR lexicon has been built by means of the IMPACT dictionary attestation tool from the quotations of the WNT (Dictionary of the Dutch language). The lexicon currently contains 475498 distinct word forms 215180 lemmata and 558438 distinct lemma/word form combinations with 1636709 attestations.
  • Group: Data
  • Type: Language resources
  • Subtype: Historical lexicon
  • License: pending
  • Language: Dutch
  • Developer: Instituut voor Nederlandse Lexicologie

Impact Dutch Institutional Dataset

  • Description:The image collection for Dutch language is provided by National Library of the Netherlands. The dataset consists of 88.192 images in high resolution
  • Group: Data
  • Type: Images
  • Subtype: -
  • License: No restriction (except newspapers: no usage allowed)
  • Language: Dutch
  • Developer: National Library of the Netherlands

Impact English Demonstrator Dataset

  • Description:The English ground truth produced by the british Library (BL) in the frame of the EU funded Impact project consists of 2.775 pages in PAGE XML format with an accuracy of 99.95%
  • Group: Data
  • Type: Groundtruth
  • Subtype: -
  • License: pending
  • Language: English
  • Developer: The British Library


Would you like to add any tool?

Registered users can add new tools through a simple form login or register.

Search or filter tools

Group:

Type:

Subtype:

In demonstrator platform: