Data set

Image data-set

Formats

The training data-set and the test data-sets for task 1 and 3 consist of grey-level images in TIFF format at 300 dpi, picturing a 100 x 150 mm part of a manuscript.
The test data-set for task 2 and task 4 consists of heterogeneously encoded data that is to say, data in different formats (JPEG/TIFF), resolutions (300/400 dpi) and color representation (grey-level and color images). Moreover, these digital images were created from different sources: digital capture from primary sources; digitization of analog, 1:1 scale photographs; digitization of microfilms. In the latter case, the technical resolution of 400 dpi does not equates to a 400 dpi resolution compared to the original primary source.

The list of classes is provided in a CSV file with 2 columns: “FILENAME,SCRIPT_TYPE” for tasks 1 and 2; “FILENAME,DATE_TYPE” for tasks 3 and 4.

Training and test sets

For tasks 1 and 2, the training set consists of 3500 images used in the ICFHR 2016 Competition on the Classification of Medieval Handwritings in Latin Script, published on http://icfhr2016-clamm.irht.cnrs.fr/icfhr-2016/

For task 3 and 4, the training test consists of 3000 images selected from the previous described training set.

The test set consists of 2000 images for task 1 and task 3, 1000 images for task 2 and 4.

Additional information

The image collection used for the training is mainly based on the collection of 9800 images from the French catalogues of dated and datable manuscripts[18]–[26], increased with the on-line documentation from the BVMM (http://bvmm.irht.cnrs.fr/) and Gallica (http://gallica.bnf.fr/) in order to build classes of the same size.
The image collections used for the competition are based on the collection of 9800 images from the French catalogues of dated and datable manuscripts for tasks 1 and 3, and on the on-line documentation from the BVMM (http://bvmm.irht.cnrs.fr/) and Gallica (http://gallica.bnf.fr/) for tasks 2 and 4.

Classes

Classes for style classification

The images of the training set are tagged according to 12 labels. The division of scripts is based on morphological differences and allographs, as defined in standard works on Latin scripts [27], [28].
The labels of the 12 style classes are given according to the alphabetical order (1= Caroline; 2 = Cursiva; 3 = Half-Uncial; 4 = Humanistic; 5 = Humanistic Cursive; 6 = Hybrida; 7 = Praegothica; 8 = Semihybrida; 9 = Semitextualis; 10 = Southern Textualis; 11 = Textualis; 12 = Uncial).
Cf. https://clamm.irht.cnrs.fr/script-classes/

Classes for Manuscript dating

The images of the training set are tagged according to 15 labels for dates.
The labels of the 15 date classes are given according to the chronological order of the manuscript production:
1 = Date before the year 1000 C.E.;
2 = Date between 1001 and 1100 C.E.;
3 = Date between 1101 and 1200 C.E.;
4 = Date between 1201 and 1250 C.E.;
5 = Date between 1251 and 1300 C.E.;
6 = Date between 1301 and 1350 C.E.;
7 = Date between 1351 and 1400 C.E.;
8 = Date between 1401 and 1425 C.E.;
9 = Date between 1426 and 1450 C.E.;
10 = Date between 1451 and 1475 C.E.;
11 = Date between 1476 and 1500 C.E.;
12 = Date between 1501 and 1525 C.E.;
13 = Date between 1526 and 1550 C.E.;
14 = Date between 1551 and 1575 C.E.;
15 = Date between 1576 and 1600 C.E.