Old characters in gothic texts

I am trying evaluate OCR texts produced by Abbyy Recognition Server. The original texts have some old characters which don’t exist anymore. For example, the “long s”: http://en.wikipedia.org/wiki/Long_s. The server recognizes it as a normal s. However, our ground truth files have the correct “long s” in them. Do you have any tips for me how to treat such characters? I am not aware of a switch in the Abbyy Server to produce those old characters.

5 thoughts on “Old characters in gothic texts

  1. What we do for evaluation is to ‘normalise’ both ground truth and FineReader result text. Thereby we replace all long s, all ligatures, and some other characters with the simple versions (the normal s in your case). Ligatures (combined characters) are replaced by the multi-character counterparts.

  2. As far as I know, FineReader recognises “long s” (ſ) as “normal s” (s). We have been testing different settings with the same output, but for Dutch. As stated by Christian, for Spanish we normalize the groundtruth before comparing with the OCR output.

  3. Thank you for your replies.
    My follow-up question is: Do you use special tools for the normalization? I imagine that you just have a mapping table of the old characters and the respective replacements. Then you use a simple find-and-replace loop with e.g. sed in a bash script. Is that correct?
    I haven’t tried it yet, but I know that this evaluation tool (https://sites.google.com/site/textdigitisation/ocrevaluation/additional-notes) has an input “equivalences file”. Do you have experiences with that? It is probably good for single characters that should be treated as equivalent, but not for combined ligatures?

  4. hi,

    We use the User Pattern Training Utility. We can train the engine to recognize special characters like the long s. But it works for other characters the engine doesn’t recognize as well of course. You find the tool in C:ProgramDataABBYYSDK11FineReader EngineSamplesDemo ToolsUser Pattern Training Utility
    The output of the training you need to include in your program you use for OCR. We use the CommandLineInterface in the sample folder.
    Some quick and dirty code we use ;-)
    IStringsCollection *patternFiles;
    engine->CreateStringsCollection(&patternFiles);
    patternFiles->Add(L”c:\patternFiles\outputfile1.ptn”);
    patternFiles->Add(L”c:\patternFiles\outputfile2.ptn”);
    engine->MergePatterns(patternFiles, L”c:\merged.ptn”);*/
    recognizerParams->put_UserPatternsFile(L”c:\merged.ptn”);

    If you really want to process gothic characters you apparently need an extra license.

    Sam

Leave a Reply

Your email address will not be published. Required fields are marked *

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>