In the late 90's, Yann LeCun's team pioneered the successful application of machine learning to optical character recognition. 25 years later, machine learning continues to be an invaluable tool for text processing downstream from the OCR process.
Tradeshift has created a dataset with thousands of documents, representing millions of words. In each document, several bounding boxes containing text are selected. For each piece of text, many features are extracted and certain labels are assigned.
In this competition, participants are asked to create and open source an algorithm that correctly predicts the probability that a piece of text belongs to a given class.