Named Entity Work in IMPACT: Frank Landsbergen
Frank began by defining named entities (NE). They are a word or string referring to a proper location, person or organisation (or date, time, etc). Within IMPACT the term is limited to location, person or organisation. The extra focus on these words is primarily because they may be of particular interest to end users and because they are usually not in dictionaries, so there is more improvement in the lexicon and ultimately the OCR. Note that a lexicon is a list of related entities in the database that are linked e.g. Amsterdams, Amsteldam, Amsteldamme = Amsterdam.
He then spent the majority of his talk walking us through the 4 step process of building a NE Lexicon:
- Data collection
- NE tagging – Possibilities include NE extraction software, use of the Stanford University module, statistical NE recognition (the software ‘trains’ itself) or manual tagging. Many of these tools currently work best with contemporary data.
- enrichment (POS tagging, lemmatizing, adding the person name structure, variants)
- database creation
So far the majority of IMPACT’s NE work has been on creating a toolkit for lexicon building (NERT, Attestation tool) and creating NE-lexica for Dutch, English and German.
Special Resources to Access 16th Century Germany: Annette Gotscharek
The 16th Century German printed book collection was a special case because the resources were so old, and therefore very challenging. There were a number of special features of the historical language on a word-level. The historical variants were all added to the modern lemma.
The diachronic ground truth corpus was text files of what appeared on the scans. It was collected from different resources on the Web and non-public electronic corpora. They worked on areas including creation of a hypothetical lexicon and manually verifying IR-lexicon.
Polish Language Resources in IMPACT: Janusz S. Bień
Janusz and his team faced a number of challenges when working with Polish text. They did not use the oldest Polish dictionary but focused on later historical dictionaries and sources used by these dictionaries. The earlier dictionaries were rejected because they had relevant information but it was too difficult to extract. They also struggled to use other later dictionaries because of copyright issues. In the end they did manage to use a selection of dictionaries and texts including the Benedykt Chmielowski encyclopedia, which is famous for its memorable “definitions”: “Horse is as everyone can see.”
They looked at a number of tools including the lemmatization in INL Lexicon tool (SAM, SpXViiw). More information is available at http://bc.klf.uw.edu.pl
Slovene Language Resources in IMPACT: Tomaž Erjavec
Tomaž previously worked on the AHLib project looking at transcription correction and markup. At the same time as their work on the IMPACT project they also won a Google award so have been able to develop language models for historical Slovene.
Their methodology has been to develop 3 resources: transcribed texts, hand-annotated corpus and a lexicon of historical words. They have also developed the annotation tool, ToTrTaLe which aids in tagging and lemmatising historical Slovene. The main issues have been tokenisation (words were split differently in historical language), variability and extinct words. During the project they have transcribed over 10 million words, these comprise of the AHLib corpus/DL, NUK GTD, Google Books and Wiki source –all are freely available.
A video of the session is here: