Tokenizing

Tokenizer

A lexicon building project in CoBaLT will always start with uploading a corpus. The tool can without difficulties be fed with untokenized documents. The only condition is that those are saved as UTF-8 encoded plain texts in .txt format. During the upload, those untokenized documents will be tokenized by the tokenizer integrated in the tool.

However, it is also possible to upload documents that are tokenized already, which will be recognized as such automatically. In order to determine whether or not a file is tokenized, the tool looks at the first 10 lines to see if they look ‘tokenized’ like this:

canonicalForm<TAB>wordForm<TAB>onset<TAB>offset<TAB>….

So, e.g. the next bit could be the output of a tokenizer for the text “Hello world!”:

image003

 helloWorld_tokenized.tab

Some character sequences might not be interesting for lexical processing in the tool even though it would be nice to see them displayed for ease of reading. This goes e.g. for punctuation marks that appear separately in a text.

Text like this, that should be displayed but which should not be treated as a wordform in the tool should have the string ‘isNotAWordformInDb’ in the fifth column of the tokenized file. Other words (as in the example above) should have an empty fifth column. So PLEASE NOTE that in that case a tab will end the line:

1

2

3

4

5

Zamensp

Zamensp.

744

751

over

over

753

757

de

de

758

760

Deenemarken

Deenemarken

801

812

. .

821

822

isNotAWordformInDb

Bl.

Bl.

844

847

210

210

848

851

Position information for OCR’ed material

When working with OCR’ed material, there is a relation between the text appearing in the tool and the original image. The tool supports the relation between the two (see section ‘Viewing a word form in the original image’ below).

The ‘documents’ table in the MySQL database has a column ‘image_location’ where a path to the image can be set. Also, the coordinates of a word occurrence in its original document image can be set in the tokenized file in its sixth to ninth column. These columns should specify x-coordinate, y-coordinate, height and width, respectively:

1

2

3

4

5

6

7

8

9

DE

DE

2268

2270

365

1700

105

1130

KAARTEN

KAARTEN

2271

2278

365

1700

105

1130

MOETEN

MOETEN

2279

2285

365

1700

105

1130

GEPLAATST

GEPLAATST

2286

2295

365

1700

105

1130

WORDEN

WORDEN.

2296

2302

365

1700

105

1130

Tokenizing XML

The default tokenizer can also handle XML files in certain formats. Currently the supported formats are IGT XML (the IMPACT ground truth format) and TEI (as provided by JSI).

A TEI file should look the following:

<?xml version=”1.0″ encoding=”UTF-8″?>
<TEI.2>
<teiHeader type=”letter”></teiHeader>
<text>
<body>
<div type=”letter” xmlns=”http://www.tei-c.org/ns/1.0″ xml:lang=”nl”n=”filename”>
<p>
<w>Desen</w>
<c> </c>
<w>brijef</w>
<c> </c>
<w>sal</w>

(…)

   <c> </c>
<w>ul</w>
<c> </c>
<w>huijsvrou</w>
</p>
</div>
</body>
</text>
</TEI.2>

TEI files need a TEI-header, within which the document type is stated. The actual text content is enclosed within a text-tag, which in turn has a body-tag.

A body-tag has div-tags enclosed. Within the div-tag, each paragraph of a text must be enclosed within a p-tag. Single words are enclosed within w-tags, and the space in-between in encoded by c-tags. The example above shows all that.

If a word has some punctuation attached, a normalized form (without punctuation) is added to the w-tag as an attribute:

<w nform=”pas”>pas,</w>
<c> </c>
<w nform=”an”>[…]an</w>

A w-tag is not allowed to contain any other tag. So if any property of a word needs to be given, it must always be set as an attribute of the w-tag.

So do absolutely NEVER write something like:

<w><abbr>ver</abbr>staen</w>

But do instead write this:

<w original=”&lt;abbr&gt;ver&lt;/abbr&gt;staen”>verstaen</w>

The above example shows that the original word with inside-tags has to be escaped and put into an ‘original’ attribute of the w-tag.

Choosing the right format: .txt, .tab or .xml (TEI)?

Before choosing to feed CoBaLT with a corpus in a given format, it is important to know the consequences of one choice above the other.

If you want to upload a corpus into CoBaLT so as to be able to annotate it, at the end of the project you will also probably want to get your original corpus files back enriched with those annotations. This can only be achieved if you choose for uploading in .xml format (TEI). At the end of a project, the scripts described in the ‘Data export’ section actually take your original files (stored in the database) and add your annotations to those before exporting them to disk. CoBaLT will not be able to do so if you originally loaded something else than .xml.

If you especially want to use CoBaLT to build a lexicon, but do not feel the need to enrich your original corpus files with annotations, then loading your corpus as .txt files (plain text) or as .tab (tokenized) files is good enough. As the project is being worked on in CoBaLT, the lexicon will be gradually built or enriched in the so-called lexicon database (which you have created in the ‘Create databases’ section). All you will have to do at the end of a project will be copying the lexicon database to a suitable location for your own needs; or you can export this database to .xml  (check the ‘Data export’ section about that).

Of course, .tab has the advantage above .txt format that you can add position information for OCR’ed material and so on, which is impossible in .txt format. Unfortunately adding position information is not supported yet in .xml (TEI) format at the moment.

Customize the tokenizer for your own language

For a tokenizer, being able to deal with a specific language means knowing its common abbreviations, so as to be able to distinguish an ‘abbreviation dot’ from a ‘sentence end dot’. It also means knowing words commonly bordered by a comma (like ’em in English), so as to be able to distinguish such comma’s from the ones used in quotations. It is very important that the tokenizer actually knows these special cases, because every dot it encounters will otherwise be interpreted as a sentence end, and a comma as the beginning or end of a quote.

So, you will need to build two files containing abbreviations and comma words or your specific language.

Let’s start with the filenames. Let’s say you would like the tokenizer to be able to process English. Choose an short name for this language, say ‘eng’. This is the name you’ll have to use throughout the tool configuration to refer to this language. Now create two text files with this short name as an extension:

abbr.eng
apostrof.eng

Now you are ready to tell the tool which abbreviations and comma words it should be able to recognize. You must declare those abbreviations in the ‘abbr.eng’ file, and the comma words in the ‘apostrof.eng’ file. Open the files in a text editor, and make sure both files contain just one word per line:

image004

abbr.eng

apostrof.eng

apostrof.eng

When you are finished, put both files into the tokenizer directory, so the tokenizer will be able to find them. If you are not sure where this directory is to be found, look it up in the php/globals.php file, as the tokenizer location must have been set there (check the ‘CoBaLT configuration’ section about this file):

$sTokenizerDir = ‘/servername/some_directory/Tokenizer_directoryname';

As a last step, you will need to tell the tool you want it to tokenize English now. You can achieve that by setting the following in the php/globals.php file:

$sTokenizerLanguage = ‘eng';