Compare with similar tools:

Abstract


BlackLab is a corpus retrieval engine built on top of Apache Lucene. It allows fast, complex searches with accurate hit highlighting on large, tagged and annotated, bodies of text. It was developed at the Institute of Dutch Lexicology (INL) to provide a fast and feature-rich search interface on our historical and contemporary text corpora.

List of features

BlackLab features include:

  • Complex fields. Complex fields are search fields that can have multiple searchable properties for each token. Properties may include word form, headword, part of speech, etc. They may be combined when searching, so you can search for all verb forms starting with “a”.
  • Find regular-expression-like patterns. Like being able to search for one or more adjectives, followed by the word “cow”, followed within 3 words by a form of the verb “to walk”.
  • Search within XML tags occurring in a text. For example, if your text is tagged with <ne></ne> tags around named entities (people, organisations, locations), BlackLab allows you to search for named entities occurring in the text that contain the word “city”.
  • Fast grouping and sorting of large result sets on several criteria, including context (hit text, left context of hit, right context of hit). For example, you can group results by the word occurring to the right of the matched word(s).
  • Accurate highlighting of hits in a document and fast KWIC (keyword in context) view of hits.
  • Random sampling of results, to get a representative sample from a large result set. [planned]
  • Fast and easy indexing of large XML datasets in several well-known formats, or add your own.

Try BlackLab online


Corpus Gysseling, a small corpus of historic Dutch (1200-1300), is our first publicly available application (still in beta) using BlackLab: http://tinyurl.com/gysseling

This simple application showcases some features of BlackLab. Here are a few searches you can try:

  • Lemma: “koe” Finds all forms of the word “koe” (cow)
  • Other words to try: “wet” (law), “zien” (to see), “groot” (large)
  • POS: “NOU*” Find all nouns
  • Other values to try: “VRB*” (verbs), “ADJ*” (adjectives)
  • Word form: “coe” Find a specific historic spelling of “koe”

You can also change the operator used to combine search clauses (And/Or), or add filter settings in the box below the search fields (click ‘Show’).

Please note that this is just a small sample of the capabilities of BlackLab.

More information

Availability

BlackLab is licensed under the Apache License 2.0. Find more information at https://github.com/INL/BlackLab

OCR Post-correction and Enrichment