Slovene Lexicon - IMPACT Centre of Competence

ABSTRACT

Apart from about 40 pages from a sixteenth-century and a seventeenth-century book, the dataset for historical Slovene contains material published from the second half of the eighteenth century to the end of the nineteenth century. The material consists of books and one daily newspaper.

The current lexicon consists of the initial 3,000 lexical entries developed in LeXtractor and the lexicon that can be automatically extracted from the manually validated tokens from the reference corpus. At the time of writing, the size of lexica extracted from the manually validated corpus tokens was as follows: 16,245 lexical entries, 15,715 word forms, 14,249 normalized, 11,396 modernized and 6,789 lemmata.

The lexicon is available in two formats. The first is a simple tabular file; apart from the information shown above, it also indicates the number of times a particular lexical item occurs in the corpus and the number of times it has been manually validated, and lists all corpus elements (page ids) in which the particular item has been attested. As these identifiers also contain the year of publication for each element, it is easy to give an estimated time period in which a particular lexical entry was used.

The second format is as structured lexical entries encoded in TEI P5, using the dictionary module. The export of this format is directly supported by CoBaLT, and we also developed a script to convert this XML into HTML for browsing. The TEI provides a stable storage format, and will serve as a resource for ToTrTaLe, while the HTML enables the inspection of lexical items in a lemma-oriented fashion, to discover remaining problems and mistakes.

PUBLICATIONS

PRODUCED BY

Jozef Stefan Institute

LICENCING

For further information on licencing, please contact JSI IMPACT Group.

DOWNLOAD

Slovene-Lexicon.tar.bz2_Download

IMPACT deliverable DEE3.13 Slovene Lexicon Documentation(February 2012)
Tomaž Erjavec. 2012. The goo300k corpus of historical Slovene. Proceedings of the Eight International Conference on Language Resources and Evaluation (LREC’12), Istanbul.
Tom Kenter, Tomaž Erjavec, Maja Žorga Dulmin, Darja Fišer. 2012. Lexicon construction and corpus annotation of historical language with the CoBaLT editor. In Proceedings of the EACL Workshop on Language Technology for Cultural Heritage, Social Sciences, and Humanities, Avignon, France, April. Association for Computational Linguistics.
Ines Jerele, Tomaž Erjavec, Daša Pokorn, Alenka Kavcic-Colic. 2012. Optical Character Recognition of Historical Texts: End-User Focused Research for Slovenian Books and Newspapers from the 18th and 19th CenturyOptical Character Recognition of Historical Texts: End-User Focused Research for Slovenian Books and Newspapers f. Review of the National Center for Digitization, 21/2012, Faculty of Mathematics, Belgrade.
Erjavec, T. “Slovenska prevodna književnost 1848-1918: digitalna književnost in korpus AHLib (Slovene translated literature 1848-1918: digital library and corpus AHLib).” Proceedings of the conference Obdobja, 2011, Interdisciplinarity in Slovene Studies, 17-19 Nov 2011, Ljubljana, Slovenia.
Erjavec, T., I. Jerele and M. Kodric. “Izdelava korpusa starejših slovenskih besedil v okviru projekta IMPACT (Compiling a corpus of historical Slovene texts in the IMPACT project)Izdelava korpusa starejših slovenskih besedil v okviru projekta IMPACT (Compiling a corpus of historical Slovene t.” Proceedings of the conference Obdobja, 2011, Interdisciplinarity in Slovene Studies, 17-19 Nov 2011, Ljubljana, Slovenia.
Erjavec, T, C. Ringlstetter, A. Gotscharek and M. Zorga “A lexicon for processing archaic language: the case of XIXth century SloveneA lexicon for processing archaic language: the case of XIXth century Slovene.” WoLeR 2011 International Workshop on Lexical Resources at ESSLLI, 1-5 August, 2011, Ljubljana, Slovenia.
Erjavec, T. “Towards a Lexicon of XIXth Century Slovene.” LaTeCH 2011, 5th Workshop on Language Technology for Cultural Heritage, Social Sciences, and Humanities, 24 June 2011, Portland, Oregon. Proceedings p.33-38, ISBN: 978-1-61839-232-9.
Erjavec, T. “Annotating historical Slovene texts: first experiments.” Conference on New Methods in Historical Corpora, 29-30 April 2011, Manchester UK.
Tomaž Erjavec, Christoph Ringlstetter, Maja ŽorgaŽorga, Annette Gotscharek. Towards a Lexicon of XIXth Century Slovene. Proceedings of the IS-JT ’10 Seventh Language Technologies Conference conference (14-15 October 2010, Ljubljana, Slovenia)