The first screen is a log-in screen. The different lexicon projects listed here should have been set in the ‘CoBaLT configuration’ section. The user logs in by simply typing in his/her name and chooses a project to work on.
In the next screen the user can choose a corpus to work on, corpora can be deleted, new corpora can be added and files can be added to or deleted from corpora.
Corpus mode vs document mode
The tool can work in two modes: corpus mode and document mode.
By clicking on a corpus name the tool will load in corpus mode (i.e. the user will work on a list of word forms for an entire corpus). By clicking on a document name instead, the tool will load in document mode, which means that you will be exclusively working with the tokens of that particular document.
Documents can be uploaded into the tool one by one, or a lot of them together in a zip file. The tool checks filename extensions. Only .txt files, .tab files and .xml files are processed (and of course .zip for a collection of such files).
First click on the directory icon to make a new corpus directory:
Then tell the tool where the corpus files are to be found (remember collections of files must be zipped in advance). Please note that every single file should be utf-8 encoded. No other encoding format is supported.
Click on the ‘Add a file…’ icon, and you’ll be able to browse to the right location to upload your file from:
Finally, click onto the ‘Upload file’ button and wait till the file is processed. If you have been choosing a large zip file, the processing can take quite a while (a few seconds after uploading started, a progress indicator will be shown in the page tab on top of the browser window).
Any untokenized document will be tokenized by the tokenizer integrated in the tool. However, it is also possible to upload tokenized documents (which will be recognized as such automatically). This is particularly handy if you need your tokenization to be different from what the default tokenizer provides. Please refer to the ‘Tokenizing’ section to see what a tokenized file should look like.
When the file upload has completed, the screen should look like this:
Now you’re ready to go and work with your files. Click on the corpus name if you wish to see and/or process the whole corpus at once, or click on a single file name if you’d like to work only with that particular document.
As soon as you have loaded files in the Corpus screen, you will be able to access the Main screen. This is the place where you will be able to work on your data:
In the screenshot above we see CoBaLT in action. It is running in corpus mode, which means that the user selected a collection of documents to work on (‘corpus mode’), rather than just one single document (‘document mode’).
The interface is divided in three main parts.
- Top bar: For sorting and filtering word forms and lemmas, and adjusting the views on the data.
- Middle part: Shows the type-frequency list plus analyses in the corpus/document and database.
- Lower part: Shows the individual occurrences of word forms (‘tokens’) in context plus their analyses.
Adjusting the size of the window parts
The ratio between the middle part and the lower part can be altered by dragging the resize icon (circled in red in the partial screenshot below).
CoBaLT offers two different ways to navigating through the data. The first one is a scrollbar, the second one is a ‘Go to wordform’ box:
When there is a large amount of pages available a scroller bar will appear at the top of the screen. Hovering the mouse cursor a the right side of the middle will scroll the list of pages to the left and vice versa. To stop the list from scrolling, move the mouse cursor to the middle of the bar. Click onto the highlighted number to get to that page number.
‘Go to wordform…’ box
When a data set consists of hundreds of pages, using the scroller bar to access the right page means scrolling for quite a while before the right page gets in sight.
If you do know exactly which word form you would like to get to, just type it into the ‘Go to wordform…’ box and press ‘Enter’. CoBaLT will then look up the right page and load it right away. This also works with only the prefix of a wordform (just enter a prefix followed by a %-sign to indicate some unknows letters should follow). The following example catches both ‘computer’ and ‘computers’:
As a project is getting on, one might want to know how much work still has to be done. The tool offers two possibilities:
– click onto the small heart icon to get a list of all word forms that are not yet lemmatized (at all)
– click onto the small diamond icon to get a list of all word forms there were but partially lemmatized.
In the following section it is described how analyses can be edited in the lower as well as in the middle part, what analyses may consist of, and how to customize the ways in which the data are presented.
An overview of key functions and mouse actions can be found below. Note that with most buttons and also with the analyses in the right-hand column in the middle part, an explanatory box appears when the mouse cursor is hovered over it.
The middle part: Word form types and their analyses in corpus/document and database
Type-frequency list Analyses in loaded Analyses in the entire
corpus or document database
In the middle part of the screen a type-frequency list is displayed with some additional options and information. A selected row is editable, allowing the user to add/delete analyses. A row is selected by clicking on it; it will be highlighted yellow (unless it is hidden, in which case it is displayed pink; see below). In this part only one row can be selected at a time; by using arrows the previous or the next row becomes selected.
The middle column displays the analyses a word form has in the loaded corpus or document (depending on the mode).
The rightmost column in the middle part shows all the analyses linked to a word form in the entire database; those in bold are validated (see below).
When a word form is selected (like Die in the example screenshot), its occurrences are loaded in the lower part of the screen together with some context (see next section).
As actions performed in the middle part apply to the tokens listed in the lower part, the functioning of the latter will be explained first.
The lower part: occurrences in context
The lower part of the screen shows the occurrences (tokens) in context of the word form selected in the middle part. The right-hand column lists the analyses of the word forms in each context.
Each occurrence is presented with some context (see below on how to adjust the amount of context in view, or the number of context rows per page).
Token plus context can be selected and deselected by clicking anywhere in the corresponding row. You can select a group of adjacent rows by holding down the mouse button while moving the mouse from the first row of the group down to the last row you wish to select. Non-adjacent rows can be selected by Ctrl-clicking them one at the time (i.e. clicking while holding down the Ctrl key). Select all rows at once by double clicking one single row. In order to deselect a single row from a group selection, Ctrl-click that row.
In this part of the tool, any action that is performed is applied to all rows that happen to be selected. Note that not every row that is selected is necessarily in view; see below on the number of context rows per page.
Editing analyses on the token level
Analyses can be added, removed, and validated in various ways on interdependent levels. The most specific level is that of a token in context.
By clicking on a yellow-highlighted word in its context in the lower part of the screen, a drop-down menu appears which lists all corresponding analyses in the database, with those analyses already assigned to the token in question highlighted. In the following screenshot, the word group drÿ en vÿftig appears to have the analysis drieënvijftig, TELW assigned.
By switching analyses ‘on’ or ‘off’, they will be (de)assigned to the token(s) in the selected row(s). The option New… at the bottom of the menu provides the possibility to type in a new analysis in a text box, with existing analyses being suggested as you type.
The rightmost column shows the analyses for the occurrence in this context. Clicking on one of them will result in this analysis being deleted from the row it is in, as well as from any other selected rows it featured in.
One analysis for a group of words
In the example screenshot the words drÿ, en, and vÿftig are marked together to make up the analysisdrieënvijftig, TELW. You can add words to a group, or delete them from it by clicking on them while holding down the F2 or F9 key. If a word you just added to the group already has analyses of its own in this particular context, these will show in the right-hand column.
More lemmata for one word form
Conversely, a word form may occasionally consist of more than one lemma. Multiple analyses can be assigned by using an ampersand (&); e.g., l’équipe could be analysed as la, DET & équipe, N.
To indicate whether attestations and analyses are verified by a user, they can be validated.
Word forms in their context can be validated (regardless of whether they have an analysis or not) by checking a validation box at the right of the context row in the lower part of the screen. This can be interpreted as ‘the user saw the word form in this context and approves of the listed analyses’. As with any other action in this part of the screen, (de)validation is applied to any sentences that happen to be selected.
By assigning an analysis to a token attestation in the lower part of the screen (either by choosing an option from the drop-down list as described above or by an action in the middle part as will be described below), the token plus its analyses become validated automatically.
When the validation checkbox is greyed out a bit, this means that this occurrence of the word form in this context was validated by a different user.
When a token is attested in its context or is validated, the analyses associated with the word form in this context automatically become validated. Validated analyses are displayed in bold in the right column of the middle part in the tool (in the example screenshot nearly all analyses are validated).
Analyses can also be (un)validated ‘by hand’ by Shift-clicking them (clicking while holding down the Shift key).
NOTE that it is not necessary for a validated type analysis to have a token attestation. E.g., the analysisduizend, TELW for the word form duijsent in the example screenshot is currently unvalidated. It only occurs once in the corpus, apparently as part of a group duizendachthonderdrieëntwintig. A user might decide to validate the analysis duizend, TELW nevertheless, even though there is no attestation to support it.
Editing analyses in the middle part
The rightmost column in the middle part of the screen shows all the analyses a word form (type) has in the entire database. As mentioned above, those in bold are validated; they can be (un)validated by Shift-clicking on them.
By clicking on an analysis it is applied to the rows that are selected in the lower part of the tool, which will in turn become validated (shown by the checkbox at the right being checked). When no row is selected, the analysis will be assigned to all tokens in the lower part, but these will NOT become validated.
The idea behind this is that in this way it is easy/fast to, e.g., when a word has more than one analysis, assign these to all occurrences of a word form without further disambiguation. If the user is selecting one or more rows (s)he must be pretty sure about it, and the analyses become validated.
An analysis for a certain word form can be deleted altogether by Ctrl-clicking it in the right-hand column. As a consequence of course, the analysis disappears as well for any token attestation for the word form in question it featured in.
As with the analyses of the entire database, the corpus/document analyses filled in in the text box in the middle part will be applied to the selected row(s) in the lower part of the tool which then will also become validated. Again, when no row is selected in the lower part, the analyses assigned in the middle part will be applied to all token attestations without them being validated.
Which analyses appear in which column?
The lower part of the screen shows occurrences in context from the corpus/document. By assigning analyses to these occurrences, the word forms become attested.
As said earlier, the middle column in the middle part of the screen shows the analyses a word form has in the current corpus/document. So these will match with the ones in the lower part of the screen.
The analyses in the rightmost column however do not necessarily show anywhere, which may be a source of confusion. This can be the case if these analyses are for word forms in a document in the same database but in a different corpus, or because they are not associated with any word form in context at all (i.e. the analysis is associated with the word form, or it would not show in the first place, but there is no attestation in context anywhere yet). The latter may e.g. be the case when a database comes preloaded with information from an external lexicon/dictionary.
On word form analyses
What do analyses look like?
An analysis as it is used in the tool (and this manual) refers to a tuple that can be described as:
modern lemma (, <modern_equivalent>) (, [set of patterns]) , part of speech (, language) (, gloss)
The parts between round brackets are optional. Only the lemma and the part of speech are obligatory, so a typical, simple lemma would be e.g. the, DET.
Modern equivalent and patterns
Optionally, a modern equivalent word form and some patterns can be specified in an analysis.
E.g. for the German word form theyle an analysis could be teil, <teile>, [(t_th, 0), (ei_ey, 2)], N. The part between angled brackets (<>) is the modern equivalent word form (which is possibly inflected) and the part in between the square brackets represents a series of pattern substitutions to get from the modern equivalent to the historical one. In the example, the substitution th for t should be applied at position 0(the first character).
Neither modern equivalent nor patterns are obligatory. Nor are they required to be specified both. In other words
teil, [(t_th, 0), (ei_ey, 2)], N and teil, <teile>, N
are valid lemmas as well.
Next to the part of speech, a language may be specified. This could be used e.g. to keep Latin phrases apart from other text.
To be able to keep similar lemmas apart, a gloss may be added. E.g. paper, N, material versus paper, N, essay or paper, N, newspaper.
NOTEthat if language names are to be used the table ‘languages’ in the database should be filled with the various options. It is only when the first word after the part of speech in the analyses matches one of these options that it is treated as a language. If it does not match it is treated as (part of) the gloss.
One word form, more lemmas
Sometimes a word form might better be analysed as consisting of two or more lemmas rather than one. In the tool this can be done by separating analyses by ampersands (&’s). E.g. l’Afrique could be analysed as le, DET & Africa, NELOC.
NOTE that in these cases the analysis cannot contain modern equivalents or patterns.
One analysis for several words
Sometimes, what can be thought of as one word appears as two or more. Consider e.g. separable verbs, which are very common in Dutch. Meedoen means to participate, and “I participate” would translate to “Ik doe mee”. In this phrase, the word forms doe and mee together make up the analysismeedoen, V.
The word forms forming one lemma do not have to be next to each other. “Nina en Ella doen morgen mee” (“Nina and Ella will participate tomorrow”) can be analysed as well by clicking on them while holding down the F2 or F9 key.
Customizing the views on the data
The type-frequency list in the middle part of the tool can be sorted by using the arrows at the left side of the top bar. The list can be sorted alphabetically by word form, or by frequency. The alphabetical ordering can also be done from right to left (so e.g. blueish, reddish and greyish will appear near each other).
Filter on word forms
There are two filterboxes in the top bar. The left filter applies to the word forms in the type-frequency list. You can filter on word form beginnings, endings or, in fact, anything in between. The filter is directly passed to the MySQL LIKE operator.
For not-MySQL-guru’s: the most frequently used wildcard is % which means any sequence of characters. So %a% means: any word form matching an arbitrary sequence of characters, then an aand then possibly some more characters. So ball would match, as would dance or pizza or indeed any word form with an a in it (including the word a itself). In the screenshot d% means all the word forms starting with a d (so dance would match again, but e.g. adorable wouldn’t).
For further documentation please refer to the MySQL documentation.
The filter is case-insensitive by default, but unchecking the box next to it will make it case-sensitive.
To de-activate a filter, empty the filterbox and apply it (by hitting ‘Enter’).
Filter by lemma
To the right of the box for filtering word forms is a box for filtering by lemma. In this box you can type in a lemma and its part of speech, separated by a comma (e.g., lemma, NOU) and only word forms that have this lemma assigned to them will be shown.
Please note that patterns are not supported in this box, only complete analyses.
Edit a lemma or delete it from the database
When a lemma filter is applied and matches a lemma, an additional icon appears next to the lemma filter box. By clicking on this icon a new sub window opens in which a lemma can be edited or deleted.
Please note that editing or deleting a lemma will apply to that particular lemma in the entire database (not just the corpus selected).
Hiding/showing word forms
There can be various reasons for hiding word forms. It could be, e.g., that one feels that words in certain closed categories (let’s say determiners like the and a) have been dealt with sufficiently and there is no need to analyse them time and time again for every new document or corpus. Or it could be convenient for a particular task to temporarily hide all word forms that have no characters in them (so e.g. cipher words will not show in the list).
At the left side of each row in the middle part of the screen there are two buttons labeled c and a in corpus mode (for “don’t show in the corpus” and “don’t show at all”, respectively), or d (“don’t show for this document) and a in document mode.
The d button (only visible in document mode) is for hiding the word form of that row for the current document. The c button (only visible in corpus mode) is for hiding the word form in that row for the current corpus.
The a button is for hiding the word form for the entire database regardless of corpus or document.
By switching on one of these buttons the row will be shown as hidden and displayed in pink. When the type-frequency list is reloaded, e.g., when a new filter is applied, or the user logs in again, the word form in question will not show again, unless the relevant show/hide button in the top bar is switched on.
In the top bar, just left of the filtering box, there are two buttons for showing, or hiding again, hidden word forms. They too are labelled c and a in corpus mode, or d and a in document mode. By default, these buttons are switched off (i.e., hidden word forms are not shown).
Word forms may be marked — by using the buttons in the middle part, or by means of a script — to be e.g. “hidden for this corpus”. If the c button at the top is inactive, the word form will not be shown. The word forms are ‘unhidden’ if the c button is activated.
E.g., the row for dra in the example screenshots above is hidden for the current corpus, but is shown nevertheless because also the c button is switched on in the top bar, which means, “do show all word forms that were hidden for this corpus”.
Number of word forms per page
The number of word forms displayed per page in the middle part is set 100 by default. In the top bar, to the right of the filtering boxes, this can be changed to 10, 20, 50, 100 or ‘all’. The latter means that all word forms that match the filter are displayed on a single page. In a large corpus this set can be very large, resulting in a long loading time.
Your choice will determine the number of pages the whole corpus/document is divided into. The horizontal bar with the page numbers will be shown only if there is more than one page to be displayed. By clicking on a number in this bar you jump to the corresponding page and the number will be highlighted. If there are a lot of word forms, the bar becomes scrollable.
Number of context rows per page
When working with large corpora some word forms might occur very often. By clicking on them in the middle part of the screen, all the contexts they occur in will be loaded into the lower part, which may slow down things considerably. Because of this, the number of context rows shown per word form per page is 15 by default. Do note however that the speed advantage is most striking when the sentences are not sorted, which is the default (‘sorted by document’). If they are sorted, the tool has to actually collect all the sentences in the background, sort them, and then show a subset of them, so this is somewhat slower.
The number of context rows to be loaded in one page can be adjusted in the top bar, from 10 up to ‘all’. If there are more rows than fit on one page, a horizontal bar appears with clickable numbers to go to other pages.
Beware that with more rows per page than visible on the screen, rows may happen to be selected, and thus affected by actions, without actually being in view.
Amount of context
The amount of context surrounding the word form occurrences in the lower part of the tool, by default set to ‘normal’, can be increased by a drop-down menu in the top bar.
The user can also see more context for a particular token by clicking the » (guillemet) at the right of the context row in question:
A pop-up window will appear in which more context is shown. In this window the user can see even more context.
Sorting the tokens in context
By default, the context rows in the lower part of the tool are displayed in the order in which they appear in the documents (‘sorted by document’).
- Sort rows by context
It can be convenient however to sort the rows by the immediate context of the occurring word forms. By clicking on the small arrow buttons at the left side of the screen, the rows will be sorted according to the words to the left of the token (either ascending or descending).
|benevens virginie Desmandryl jonge dogter alle||drÿ||[…]|
|hoop tisten wel hadde konnen onderscheÿden van de||drÿ||[…]|
|lagen en desen hoop wel onderscheÿden van de||drÿ||[…]|
|maand meert in het iaar een duizend acht honderd||drÿ||[…]|
|Pro justitia Ten jaere agtien honderd||drÿ||[…]|
|van tisten en was gemaekt, maer dat ‘er nog||drÿ||[…]|
|verklaerd genaemd te zÿn pieter Baelde, oud||drÿ||[…]|
|distinctelijk kwam te remarquéren de voetstappen van||drÿ||[…]|
By using the arrow buttons at the right side of the screen the rows can be sorted by the right-hand side of the context. For the example screenshot this would be:
|[…]||drÿ||ander hoopen, om reden dat zÿ in de maenden|
|[…]||drÿ||andere hoopen waeren, vervold met zwÿn aerdappelen,|
|[…]||drÿ||andere inhoudende verkens-aerdappelen, rapen, en karoten;|
|[…]||drÿ||diversche persoonen, wanof twee aengedaen met schoon|
|[…]||drÿ||en twintig Wÿ ondergeteekende joannes|
|[…]||drÿ||en twÿntig den tiensten februarius, zÿn voor ons|
|[…]||drÿ||en vÿftig jaeren, getrouwd, landbouwer woonende op|
|[…]||drÿ||woonende op t’ gezegde Langemarck, welke|
- Sort rows by validation
Rows can be sorted by validation by clicking on one of the checkbox buttons next to the ‘sort by left context’ buttons. This way all validated rows are grouped at the top/bottom, so it is e.g. easy to see what has been done and still needs to be done.
- Sort rows by lemma
At the right-hand side, there are two buttons for sorting by lemma. Clicking on these buttons results in the rows being sorted by their analyses.
Again, for the sample screenshot this would amount to:
- Back to corpus page/start page
At the right-hand side of the top bar there are two buttons. One, labelled Corpora, for going back to the available corpora in the current database. The other one, labelled Start page, gets you back to the start screen where a database can be selected.
- Working with OCR’ed material
CoBaLT is a word form based tool. Word forms are grouped together and can be processed together. It is not always the case though that the word forms themselves are definite. Especially when working with OCR’ed material it could be the case that an OCR error occurred and that a particular word form was mistaken for another.
If the right data is available, the word form can be viewed in the tool in the original image it was scanned from. This way it is very easy, with one mouse click, to see whether or not the word was recognised correctly.
- Viewing a word form in the original image
If the right data is provided in the database some extra icons will show in the tool.
Clicking on a ‘view image’ icon in the document title bar will open a new window with the image. If an icon in a word form row is clicked on, the same image will load and the word form in question will be highlighted.
Note that the border/highlight can be removed by pressing a key (any key will do). Releasing the key will restore the highlighting again. This can be handy because the highlighting border sometimes covers surrounding characters.
- Overview of keys and mouse clicks
Below is a table that lists key and mouse behaviour in the tool. Please note that if you hold the mouse cursor over any clickable item in the tool a short message will be displayed about its function.
|Select multiple rows||Click and drag||Moving the mouse cursor over any rows in the lower part of the screen (word forms in their contexts) while holding down the mouse button will result in all these rows being selected.
Please NOTE that if you do this very quickly, some rows might be left out of the selection, so go slow (enough).
|Add to/subtract from selection||Ctrl-click||Clicking on a row in the lower part of the screen while holding down the Ctrl key will cause this row to be added to/subtracted from any existing selection (while if you do not hold down the Ctrl key you will start making a new selection).|
|Select all rows||Double click||Double clicking in the lower part of the screen will result in all rows being selected.|
|Add analysis||Click form||Clicking on a word form in context gives the available analyses in a drop-down-menu, or the option to type a new one.|
|Delete analysis||Click analysis||Clicking on an analysis in the lower part will result in this analysis being deleted for this row and for any other rows selected.|
|Toggle validation||F8||Pressing the F8 button will cause any selected rows to be validated/unvalidated.|
|Make word form group||F2/F9||Clicking on a word form in the context of another word form while holding down F2 or F9 will result in a word form group.|
|Go to middle part||Ctrl-m||Pressing the ‘m’ button while holding down the Ctrl key will put the focus on the middle part (so, e.g. the up/down arrows will select the row above/below again).|
|Middle part||Previous/next word form||↑ ↓||If the focus is on the middle part, the arrow keys will select the previous/next row.|
|Add analysis||Click row||Clicking in the left-hand part of a row in the middle part gives the available analyses in a drop-down-menu, or the option to type a new one.|
|Add analysis to forms in lower part||Click analysis||Clicking on an analysis in the right-hand column of the middle part will result in this analysis being assigned to all selected rows in the lower part.|
|Delete analysis||Ctrl-click||Clicking on an analysis in the right-hand column of the middle part while holding down the Ctrl key will cause this analysis to be deleted for the word form.|
|Toggle validation||Shift-click||Clicking on an analysis in the right-hand column in the middle part while holding down the Shift key will (un)validate it.|
|Drop-down menu||Browse through menu||↑ ↓||When a drop-down menu is active, the arrow keys allow you to browse through it. Hit ‘Enter’ to select the highlighted option.|
NOTE: pay attention to where you click!
Clicking on a row in the middle or lower part will select this row. However, as described above, clicking on an analysis in the middle part causes it to be applied, and clicking on an analysis in the lower part causes it to be deleted. So, if you just want to select a row, you will usually want to avoid doing so by clicking on any analysis it contains. You would better click anywhere else in the row.
Another detail to pay attention to is that some behaviour can make it appear to the tool as if the Ctrl key is held down when it is not. This situation is particularly perilous if you click on an analysis in the middle part in order to assign it to some rows below, because it will be deleted instead!
There is an indicator next to the ‘sort by left context’ buttons in the middle of the screen that will say whether or not the tool thinks the Ctrl key is held down or not (to see it in action, start the tool and press/release the Ctrl-key).