Input image formats
According to the manual page, most image file formats (anything readable by the Leptonica image processing library) are supported. The Leptonica project page (http://code.google.com/p/leptonica/) lists at least jpg, png, tiff, bmp, pnm, gif, ps, pdf and webp.
Currently supported languages for version 3.02 are: Afrikaans, Albanian, Arabic, Azerbaijani, Basque, Belarusian, Bengali, Bulgarian, Catalan, Cherokee, Chinese (Simplified), Chinese (Traditional), Croatian, Czech, Danish, Dutch, Esperanto, Estonian, Finnish, Frankish, French, Galician, German, Greek, Hebrew, Hindi, Hungarian, Icelandic, Indonesian, Italian, Italian (Old), Japanese, Kannada, Korean, Latvian, Lithuanian, Macedonian, Malay, Malayalam, Maltese, Middle English (1100-1500), Middle French (ca. 1400-1600), Norwegian, Polish, Portuguese, Romanian, Serbian (Latin), Slovakian, Slovenian, Spanish, Spanish (Old), Swahili, Swedish, Tagalog, Tamil, Telugu.
Language data can be downloaded at code.google.com/p/tesseract-ocr/downloads/list. The uncompressed trained data should be copied to the TESSDATA directory.
Tesseract 3.0.2 supports recognitions of images containing text in more than one language. Users can specify several languages and Tesseract will use the most accurate recognition as a result. Users need to keep in mind that recognition of pages in several languages last much longer than in case of one language profile.
The fact that your image format is supported and your language is implemented does not necessarily mean that your recognition results will be satisfactory. The main reasons for suboptimal results are
- Poor quality images, for instance low-resolution black and white images from old microfilms
- Degraded documents (warped, unclear printing, damaged, …)
- Font shapes unknown to the engine
- Your language may be listed as supported, but the actual language in your documents may be incompatible with the implemented language support, if it contains specific terminology, historical or regional language.
Extending language support
One of the peculiarities of Tesseract is that glyph shape training data and language support data are tied up. This means that compiled word lists are part of the trained data bundle. A limited amount of words can be added without building a new data package, as a user word list.
Otherwise, one has to retrain the engine (cf. relevant section). A workaround for the entanglement of language and font data is as follows . Put the trained data file for your language in a separate directory. Now changedir to that directory. Assume the trained data file you start from is LANG.traineddata.
- Unpack trained data combine_tessdata –u traineddata_file LANG.
- Compile a word list to dawg format wordlist2dawg your_word_list new_dawg_file LANG.unicharset
- Replace the word_dawg cp new_dawg_file LANG.word-dawg
- Repack the trained data combine_tessdata LANG.
- Install your file LANG.traineddata by copying it to the tesseract data directory .