Training data set
For the training data, we suggest to use the “ICDAR17 Historical-WI” test dataset. We will not forbid the training on additional data, such as the ICFHR2016 of ICDAR2017 “Competition on the Classification of Medieval Handwriting in Latin Script”.
The organizers have prepared an additional corpus. It encompasses handwritten letters as well as book scripts of the Middel Ages and 16th century. The corpus can be downloaded at the following link: https://faubox.rrze.uni-erlangen.de/dl/fiDbCEE4qLU1LYx1uS1sdNgt/wi_comp_19_validation.zip
It contains 300 writers contributing 1 page, 100 writers contributing 3 pages, and 120 writers contributing 5 pages resulting in 1200 images of 520 writers.
Test data set
The test data set contains 20,000 images, with different, i.e. 1-5 samples per writer. The corpus can be downloaded at the following link: https://faubox.rrze.uni-erlangen.de/dl/fiCb7MhhW8nyM9Vy8d9KUxZu/wi_comp_19_test.zip
Update (23/04/2019): same corpus in higher resolution images [25 Gb] https://faubox.rrze.uni-erlangen.de/dl/fiJ9VmgHEnvxxz87eia3HYdK/wi_comp_19_test_v2.zip
The main focus of this new corpus is the writers of books in the European Middle Ages, especially 9th to 15th century CE.
The larger part of the corpus is anonymous. Indeed, few of the writer of this period signed their products and fewer are known by their names. In this part, given that paleographers’ attributions across books may be disputed, the organizers posit that consecutive pages in a homogeneous part of a book represent one particular writer.
A smaller part of the corpus is composed by script samples from books that are believed or demonstrably known to have been written by the same individual, such as literary autographs. Concerning this subset and for the sake of homogeneity in the competition, the corpus comprises also five consecutive pages of each of the selected autograph books.
In the test data set, most images are taken from IIIF compliant repositories allowing the use of images for scientific and teaching purposes. The organizers crop the selected images in a randomly chosen text region and in different sizes, in order to avoid that the participants base the image retrieval on page layout or digitization protocols (color scale, ruler, etc.).
Image format
In both the training and the test data sets, the images are in different resolutions and formats (jpg, tif, grey-level, color, etc.). After the competition, the images will be published on Zenodo and linked from the CLAMM competition website http:\clamm.irht.cnrs.fr, and sent to the IAPR-TC11.