Corpus Based Lexicon Tool (CoBaLT)

Compare with similar tools:


A tool for corpus-based lexicon construction. Users can upload a text dataset (corpus) for use in creating an attestation-based lexicon.

This tool is used to manually correct the automatically lemmatized corpus text. Verified lemmatized words plus the context in which they appear will be stored in the Information Retrieval Lexicon. The tool can handle plain text and various XML formats, among which the IMPACT Page XML format and TEI. An important requirement of the tool is that it should be fit to quickly process large quantities of data, that it is a web application that can be run from any computer in the local network, that frequent input actions can be performed with the keyboard, and that the information is presented in such a way that quick evaluation is possible.

Corpora screen

Corpora screen

Main screen

Main screen

New word analysis

New word analysis


  • IMPACT deliverable D-EE2.6 Lexicon Cookbook (December 2011)
  • Tom Kenter, Tomaž Erjavec, Maja Žorga Dulmin, Darja Fišer. 2012. Lexicon construction and corpus annotation of historical language with the CoBaLT editor. In Proceedings of the EACL Workshop on Language Technology for Cultural Heritage, Social Sciences, and Humanities, Avignon, France, April. Association for Computational Linguistics.


The tool is available to the research community under the Apache Software License (ASL).

OCR Post-correction and Enrichment

Further resources

Succeed training materials

Abbyy FineReader


CoBaLT (Corpus Based Lexicon Tool) is an application in which a corpus of texts can be loaded so as to be able to annotate its tokens (lemmatize and more).