Slovene Lexicon


Apart from about 40 pages from a sixteenth-century and a seventeenth-century book, the dataset for historical Slovene contains material published from the second half of the eighteenth century to the end of the nineteenth century. The material consists of books and one daily newspaper.

The current lexicon consists of the initial 3,000 lexical entries developed in LeXtractor and the lexicon that can be automatically extracted from the manually validated tokens from the reference corpus. At the time of writing, the size of lexica extracted from the manually validated corpus tokens was as follows: 16,245 lexical entries, 15,715 word forms, 14,249 normalized, 11,396 modernized and 6,789 lemmata.

The lexicon is available in two formats. The first is a simple tabular file; apart from the information shown above, it also indicates the number of times a particular lexical item occurs in the corpus and the number of times it has been manually validated, and lists all corpus elements (page ids) in which the particular item has been attested. As these identifiers also contain the year of publication for each element, it is easy to give an estimated time period in which a particular lexical entry was used.

The second format is as structured lexical entries encoded in TEI P5, using the dictionary module. The export of this format is directly supported by CoBaLT, and we also developed a script to convert this XML into HTML for browsing. The TEI provides a stable storage format, and will serve as a resource for ToTrTaLe, while the HTML enables the inspection of lexical items in a lemma-oriented fashion, to discover remaining problems and mistakes.



Jozef Stefan Institute


For further information on licencing, please contact JSI IMPACT Group.