Diachronic search

Rafael CarrascoNews

As a demonstration of the potential applications of the Impact ground-truth collection, we have implemented a new online service which enables users to perform linguistically enriched queries on a collection of historical texts. The interface is available at http://data.cervantesvirtual.com/blog/diasearch/ and today, it works with the 86 Spanish texts in the Impact-BVC corpus.

The advanced search supports queries whose search terms can be a combination of surface forms, lemmata, parts of speech and historical variants. The input form consists of a simple text box where multiple query terms can be specified.

Every term can be preceded by a prefix with the following syntax:

  • If no prefix is added, the term denotes a diachronic form (verbatim text).
  • The prefix modern# denotes a modern form.
  • The prefix lemma# is followed by a lemma.
  • The prefix pos# denotes a part-of-speech tag.

Of course, terms with different prefixes can be combined to build up a single query or combined with  the rich query syntax provided by Lucene. For example,  the results obtained after the query "lemma#haber modern#de pos#verb" return occurrences like "ha de vencer" or "aura de tornar".

The method can be easily extended to other languages and documents in the Impact data sets.

This demonstrator provides an additional example of the services the Impact Centre of Competence can deliver.