Digitisation competitions & benchmarking

A registry of competitions and benchmarking relevant to digitisation and related fields.

Competition/benchmarking name
ICDAR2019 Competition on Post-OCR Text CorrectionOptical Character Recognition, Post-CorrectionThe accuracy of Optical Character Recognition (OCR) technologies considerably impacts the way digital documents are indexed, accessed and exploited. During the last decades, OCR engines have been constantly improving and are today able to return exploitable results on mainstream documents. But in practice, digital libraries have on shelves many transcriptions with a quality below expectation. In fact, ancient documents with challenging layouts and various levels of conservation such as historical newspapers still resist to modern OCRs. Moreover, formerly digitized resources processed with out-dated OCRs are rarely re-sent through the latest state-of-the-art digitization pipeline, as priority is often given to the ever-growing masses of new arriving documents. In this context, OCR post-correction approaches, either used on former digitized documents or on fresh challenging documents, could strongly benefit digital libraries.La Rochelle Université, L3i Laboratory, Bibliothèque nationale de France, NewsEye projecthttps://sites.google.com/view/icdar2019-postcorrectionocr
ICDAR (International Conference on Document Analysis and Recognition)Layout AnalysisInternational conference that hosts competitions on document analysis and recognition
ScriptNetScriptNet is a platform of competitions running under the READ projectREAD
ICFHR (International Conference on
Frontiers in Handwriting Recognition)
Hadwriting RecognitionInternational conference that hosts competitions on HTR
KaggleWebsite hosting competitions and datasets, courses, etc.https://www.kaggle.com/
Denoising Dirty DocumentsImage EnhancementThis competition challenges you to give documents with stains, faded sun spots, dog-eared pages, etc. a machine learning makeover. Given a dataset of images of scanned text that has seen better days, you're challenged to remove the noise. Improving the ease of document enhancement will help us get that rare mathematics book on our e-reader before the next beach vacation.https://www.kaggle.com/c/denoising-dirty-documents
Tradeshift Text ClassificationText ClassificationIn this competition, participants are asked to create and open source an algorithm that correctly predicts the probability that a piece of text belongs to a given class.https://www.kaggle.com/c/tradeshift-text-classification
Greek Media Monitoring Multilabel Classification (WISE 2014)Text ClassificationThis is a multi-label classification competition for articles coming from Greek printed media. Raw data comes from the scanning of print media, article segmentation, and optical character segmentation, and therefore is quite noisy. Each article is examined by a human annotator and categorized to one or more of the topics being monitored.The competition is organized by media monitoring solutions company DataScouting, media monitoring services company ENIMEROSI and the Deparment of Informatics of the Aristotle University of Thessaloniki. It is the challenge accompanying the 15th International Conference on Web Information System Engineering (WISE 2014) that will be held in Thessaloniki, Greece on 12-14 October 2014.https://www.kaggle.com/c/wise-2014
Optical Character Recognition (OCR) Feasibility ChallengeOptical Character RecognitionThis project is to find a solution to address our client’s challenge of automating the handwriting and text recognition on scanned documents with 95%+ accuracy. Current volume is approximately 5,000/month, however with an accurate solution may increase.TopCoderhttps://www.topcoder.com/challenges/30052125/?type=develop&tab=details
ASAR 2018 Layout Analysis Competition Challenge: Physical layout analysis of scanned Arabic booksLayout AnalysisThis competition will provide: (1) a benchmarking dataset for testing physical layout analysis solutions, which contains an annotated test set of scanned Arabic book page samples with a wide variety of content and appearance, and (2) a full evaluation scheme by offering code to compute a set of evaluation metrics to both analysis tasks (segmentation and classification) for quantitative evaluation, and to visually asses the analysis result for qualitative evaluation.This competition will be organized by members of a joint team from Boston University USA, and Electronics Research Institute, Egypt:
Randa Elanwar, Researcher, ERI (randa.elanwar@eri.sci.eg): for queries and suggestions
Margrit Betke, Professor BU (betke@bu.edu)
Rana S.M. Saad, Research assistant ERI (rana@eri.sci.eg): for questions or problems with annotations.
Wenda Qin, M.S student BU (wdqin@bu.edu): for questions or problems with competition web page.
Benchmarking of Document Image Analysis Tasks for Palm Leaf Manuscripts from Southeast AsiaImage Analysispaper at Journal of ImagingMade Windu Antara Kesiman 1,2,* OrcID, Dona Valy 3,4, Jean-Christophe Burie 1, Erick Paulus 5, Mira Suryani 5, Setiawan Hadi 5, Michel Verleysen 2, Sophea Chhun 4 and Jean-Marc Ogier 1OrcID
1 Laboratoire Informatique Image Interaction (L3i), Université de La Rochelle, 17042 La Rochelle, France
2 Laboratory of Cultural Informatics (LCI), Universitas Pendidikan Ganesha, Singaraja, Bali 81116, Indonesia
3 Institute of Information and Communication Technologies, Electronic, and Applied Mathematics (ICTEAM), Université Catholique de Louvain, 1348 Louvain-la-Neuve, Belgium
4 Department of Information and Communication Engineering, Institute of Technology of Cambodia, Phnom Penh, Cambodia
5 Department of Computer Science, Universitas Padjadjaran, Bandung 45363, Indonesia
MediaEval Benchmarking Initiative for Multimedia EvaluationInformation RetrievalMediaEval is a benchmarking initiative dedicated to developing and evaluating new algorithms and technologies for multimedia retrieval, access and exploration. It offers tasks to the research community that are related to human and social aspects of multimedia. MediaEval emphasizes the 'multi' in multimedia and seeks tasks involving multiple modalities, e.g., audio, visual, textual, and/or contextual. Our larger aim is to promote reproducible research that makes multimedia a positive force for society.http://www.multimediaeval.org/mediaeval2018/
TC11 committee of IAPRImage AnalysisBenchmarking focused but not restricted to datasetshttp://tc11.cvc.uab.es/datasets/type/