The tokenizer is used to pre-process documents that form the corpus used to build the lexicon.
Tokenization is the process of breaking down a stream of text into words or tokens. This tokenizer is based on ILKTOK, part of the ‘Tadpole’ language processing suite (ilk.uvt.nl/software/). A rewrite of the code was necessary in order to produce the output required for the database used for the IMPACT Lexicon and to introduce a more modular approach.
The output consists of a file in which every token with its additional fields is printed on a separate line. Additional fields are the onset and offset of the token in the input file and the complete fragment of the document that contains the token. All fragments together compose the complete original document. The fields are separated by a TAB.
- IMPACT deliverable D-EE2.6 Lexicon Cookbook (December 2011)