Language resources

The Impact Centre of Competence provides historical and named-entities lexica for the following languages. In addition, we offer access to the different corpora:

Historial and named entities lexica

Bulgarian lexica

Bulgarian lexicon

The current lexicon consists of 28,857 lexical entries developed in LeXtractor.

Bulgarian
Czech

Czech lexicon

The period covered by the Historical Lexicon of Czech is between 1800 and 1900.

Czech
Dutch

Dutch lexicon

The period covered by the Historical Lexicon of Dutch is since 1600 until 1940.

Dutch
English

English lexicon

The period covered by the Historical Lexicon of English is since 1497 until 1900.

English
French

French lexicon

The Historical Lexicon of French is focused on the 17th century, late Renaissance French.

French
German

German lexicon

The German historical corpus consists of 510 texts varying in length and including different genres.

German
Czech

Polish lexicon

The ground truth material for Polish consists of books published from 1617 to 1756.

Polish
Slovene

Slovene lexicon

The dataset contains material published from the second half of the eighteenth century to the end of the nineteenth century.

Slovene
Spanish

Spanish lexicon

Fourteen works of Spanish Literature and a dictionary were selected for the IMPACT Demonstrator dataset.

Spanish
Latin

Latin lexicon

Produced by Universidad de Alicante.

Latin

Corpora

Bulgarian lexica

IMPACT-es diachronic corpus

IMPACT-es diachronic corpus of historical Spanish compiles over one hundred books. A complementary lexicon which links more than 10 thousand lemmas.

IMPACT-es CORPORA
IMP Slovene Corpora

IMP Slovene Corpora

The reference corpus of historical Slovene goo300k contains the text from 1,100 pages sampled from the IMP collection with hand-validated linguistic annotation.

IMP Slovene Corpora

Corpora search services

Diasearch - Diachronic corpus search service

Diasearch – Diachronic corpus search service

Diasearch is an online service which enables users to perform linguistically enriched queries on a collection of historical texts. It is currently available for the Spanish IMPACT-es corpus.

Diasearch service
Golden Age sonnet search service
 

Golden Age sonnet search service

The Golden Age sonnet search service is developed for the exploitation of a
TEI-based Spanish poetry corpus which compiles 5078 sonnets written during
the 16th and 17th centuries.

Golden age search service
Polish GT Corpora
 
 

IMPACT Polish GT Corpora

The search engine, made available by the Formal Linguistics Department of the University of Warsaw, facilitates searching digitalized texts in the DjVu format.

IMPACT Polish GT Corpora

What is a lexicon?

A lexicon is a structured, machine-usable repository of relevant linguistic knowledge about words in a language. A lexicon will contain historical variants (orthographical variants, inflected forms) and link them to a corresponding dictionary form in modern spelling (known as a ‘modern lemma’). In this way, a user can search for a modern word (‘water’) and receive results that take into account all historical variants in that language (‘wæter’, ‘weter’, ‘waterr’, ‘watre’, etc.)

Download resources

The download of these resources are available for members of the Impact Centre of Competence and registered users.