Evaluation - Classification of Medieval Handwritings in Latin Script

The evaluation will be done using a leave-one-image-out cross-validation approach. This means that every image of the test set will be used as query for which the other test images will have to be ranked. Additionally, the users will have to provide an estimate of the number of relevant samples for every query.

Data evaluation

In this track participants will be provided with a test-set. They will have to perform a traditional leave-one-out competition and will be asked to report ranking of all the test-set samples for each query document. The participants should also report for each query an estimate of how many documents in the database match the query. The results will be evaluated using the metrics described in Section 4.3. The organizers of the competition will make the toolkit for performance evaluation publicly available on the 28th of February 2019, along with the release of the validation dataset.

Error Metrics

Our test-set is unbalanced class-wise. Queries can have from four
samples down-to no samples of the same class to be retrieved. Two
different metrics will be evaluated and reported.
The first metric will be the mean Average Precision (mAP) which will be
estimated on the distance matrix which is provided by the participants.
The second metric is a classification F-Score which is the harmonic mean
of Precision and Recall.
For this metric’s measurement, the aforementioned distance matrix will
be used in combination with a single integer per query containing an
estimate of how many relevant samples exist for that query.
If no relevance estimates are provided, four (the greatest possible
number of relevant samples) will be used for all queries.

Evaluation system

For more details or questions, contact us or refer to the implementation
of the metrics:
https://github.com/anguelos/wi19_evaluate/blob/master/wi19/metrics.py
https://github.com/anguelos/wi19_evaluate/blob/master/test/test_metrics.py