Tools for text digitisation

More than
250
state-of-the-art tools for text digitisation.

286 results

Tools

Impact Polish Demonstrator Dataset

  • Description:The Polish ground truth produced by Poznań Supercomputing and Networking Center (PSNC) in the frame of the EU funded Impact project consists of 4.693 pages in PAGE XML format with an accuracy of 99.95%
  • Group: Data
  • Type: Groundtruth
  • Subtype: -
  • License: CC-BY
  • Language: Polish
  • Developer: Poznań Supercomputing and Networking Center

Impact Polish Historical Lexicon

  • Description:The primary resource was the Internet dictionary we shall refer to as the "Late Middle Polish dictionary" its official name being "The dictionary of the Polish language of the sixteenth and the first half of the seventeenth century".''''The current lexicon consists of 9909 lemmata 24977 word forms and 26736 lemma/word forms combinations.''''Also a set of more than 100 rules for historical spelling of Polish developed for the IMPACT project are now available
  • Group: Data
  • Type: Language resources
  • Subtype: Historical lexicon
  • License: pending
  • Language: Polish
  • Developer: University of Warsaw

Impact Polish Institutional Dataset

  • Description:The image collection for Polish language is provided by Poznań Supercomputing and Networking Center. The dataset consists of 11.020 images in high resolution
  • Group: Data
  • Type: Images
  • Subtype: -
  • License: CC-BY
  • Language: Polish
  • Developer: Poznań Supercomputing and Networking Center

Impact Slovene Demonstrator Dataset

  • Description:The Slovene ground truth produced by the National and University Library of Slovenia (NUK) in the frame of the EU funded Impact project consists of 4.937 pages in PAGE XML format with an accuracy of 99.95%
  • Group: Data
  • Type: Groundtruth
  • Subtype: -
  • License: CC-BY-NC-SA
  • Language: Slovene
  • Developer: National and University Library of Slovenia

Impact Slovene Historical Lexicon

  • Description:Apart from about 40 pages from a sixteenth-century and a seventeenth-century book the dataset for historical Slovene contains material published from the second half of the eighteenth century to the end of the nineteenth century. The material consists of books and one daily newspaper.''''The current lexicon consists of the initial 3000 lexical entries developed in LeXtractor and the lexicon that can be automatically extracted from the manually validated tokens from the reference corpus. At the time of writing the size of lexica extracted from the manually validated corpus tokens was as follows: 16245 lexical entries 15715 word forms 14249 normalized 11396 modernized and 6789 lemmata.
  • Group: Data
  • Type: Language resources
  • Subtype: Historical lexicon
  • License: CC-BY
  • Language: Slovene
  • Developer: Jozef Stefan Institute

Impact Slovene Institutional Dataset

  • Description:The image collection for Slovene language is provided by National and University Library of Slovenia. The dataset consists of 41.313 images in high resolution
  • Group: Data
  • Type: Images
  • Subtype: -
  • License: CC-BY-NC-SA
  • Language: Slovene
  • Developer: National and University Library of Slovenia

Impact Spanish Demonstrator Dataset

  • Description:The Spanish ground truth produced by Universidad de Alicante (UA) in the frame of the EU funded Impact project consists of 11.444 pages in PAGE XML format with an accuracy of 99.95%
  • Group: Data
  • Type: Groundtruth
  • Subtype: -
  • License: CC-BY-NC-SA
  • Language: Spanish
  • Developer: National Library of Spain

Impact Spanish Historical Lexicon

  • Description:Fourteen works of Spanish Literature and a dictionary (consisting of 6 volumes) were selected for the IMPACT Demonstrator dataset. Most books are from the sixteenth or seventeenth century known as the Spanish Golden Age. They are mostly literary works: religious plays novels poetry... Just one book belongs to eighteenth century as does the Diccionario de Autoridades. Two of these books are from America: Cartha Athenagorica by Sor Juana Inés de la Cruz and Commentarios reales by Inca Garcilaso de la Vega they were selected in order to register the vocabulary of Spanish in Latin America.''''Apart from these books a selection of 86 works between late 15th Century and 17th Century were selected from Biblioteca Virtual Miguel de Cervantes consisting of almost 2 million tokens and 90.000 word forms.''''The current lexicon consists of 11846 lemmata 31584 word forms and 36857 lemma/word forms combinations.
  • Group: Data
  • Type: Language resources
  • Subtype: Historical lexicon
  • License: CC-BY-NC-SA
  • Language: Spanish
  • Developer: University of Alicante

Impact Spanish Institutional Dataset

  • Description:The image collection for Spanish language is provided by Biblioteca Nacional de España. The dataset consists of 60.180 images in high resolution
  • Group: Data
  • Type: Images
  • Subtype: -
  • License: CC-BY-NC-ND
  • Language: Spanish
  • Developer: National Library of Spain

Impact Tools

  • Description:The spelling of words in historical texts can differ widely from modern spelling There are two general approaches to match different spellings First it is possible to use rewrite rules that transform words in one spelling to another For historical dictionary which covers a large timespan and in which variation is not limited to orthography this approach is not satisfactory Therefore the use of statistics is often needed
  • Group: text processing
  • Type: NLP Tools
  • Subtype: Spelling variations
  • License:
  • Language: 0
  • Developer: http://www.inl.nl/home

Impact Tools - Lemmatization

  • Description:IMPACT provides tools for: 1. Reducing historical word forms to one or several possible modern lemma's (lemmatization) 2. Expanding lemma lists with part of speech information to possible ("hypothetical") full forms.
  • Group: Text Processing
  • Type: NLP Tools
  • Subtype: Lemmatization
  • License: ASL 2.0
  • Language: -
  • Developer: http://www.inl.nl/home

Impact Tools - Spelling variations

  • Description:The spelling of words in historical texts can differ widely from modern spelling. There are two general approaches to match different spellings. First it is possible to use rewrite rules that transform words in one spelling to another. For historical dictionary which covers a large timespan and in which variation is not limited to orthography this approach is not satisfactory. Therefore the use of statistics is often needed.
  • Group: Text Processing
  • Type: NLP Tools
  • Subtype: Spelling variations
  • License: ASL 2.0
  • Language: -
  • Developer: http://www.inl.nl/home


Would you like to add any tool?

Registered users can add new tools through a simple form login or register.

Search or filter tools

Group:

Type:

Subtype:

In demonstrator platform: