Dutch Lexicon

ABSTRACT

The period covered by the Historical Lexicon of Dutch is since 1600 until 1940 and the type of material used is books, newspapers and parliamentary papers.

The Dutch IR lexicon has been built by means of the IMPACT dictionary attestation tool from the quotations of the WNT (Dictionary of the Dutch language). The lexicon currently contains 475,498 distinct word forms, 215,180 lemmata, and 558,438 distinct lemma/word form combinations, with 1,636,709 attestations. The OCR lexicon is corpus-based, using the large historical corpus from the DBNL (Digitale Bibliotheek voor de Nederlandse Letteren, Digital Library for Dutch Literature). An alternative OCR lexicon based on the contents of the IR lexicon is also available. A reference set of 10,000 pairs (modern word plus historical word), taken from the WNT-based lexicon, has been built for the inventory of historical spelling variation. Based on this, a set of spelling variation rules has been constructed.

The Core Named Entities Lexicon for Dutch is an elaborate database of enriched historical Dutch locations, person names and organisations from the period 1750 – 1945. It can be used as a lexicon for OCR and for query expansion in retrieval.

PUBLICATIONS

PRODUCED BY

Instituut voor de Nederlandse Taal

LICENCING

The lexicon built during IMPACT project for Dutch is available to the research community according to the regulations of the Dutch HLT agency (TST-Centrale, www.inl.nl), which means that it is freely available for non-commercial use.

DOWNLOAD