The procedures to get and process the datasets are described here.
This is the folder structure you need to handle data used in the CoMix repository. The data is stored in the following sub-folders:
- original datasets
data/datasets: contains the downloaded raw data (from other sources), used in the CoMix repository.
- processed datasets
data/datasets.unify: contains the unified data after having processed all the raw images and the splits (this folder is already present).
- unified datasets
data/comix.coco: in order to evaluate the models, the data is converted to COCO format.
- predictions
predicts.coco: predictions for detection modelspredicts.caps: captioning
The eBDtheque dataset can be downloaded from the website, after registration. Once you have downloaded the dataset, ONLY place the Pages folder into the datasets/eBDtheque folder.
To convert the images of eBDtheque to the unified format, run the following command:
$ python comix/process/ebdtheque.pyCheck the comix/process/ebdtheque.py file for the arguments, if you want to change the default values.
According to the license of Manga109, the redistribution of the images of Manga109 is not permitted. Thus, you should download the images of Manga109 via the Manga109 webpage.
After downloading, unzip Manga109.zip into the folder datasets. Move all the contents of Manga109_released_x to the upper folder, then delete the empty directory Manga109_released_x.
Remove unused files:
cd data/datasets/Manga109
rm -rf annotations.v20*
rm -rf annotationsThe folder structure should look like this:
datasets/
└── Manga109
├── images
├── books.txt
├── readme.txtTo convert the images of Manga109 to the unified format, run the following command:
$ python comix/process/manga109.pywhich has the following arguments:
--input-path: path to the Manga109 folder (default:data/datasets/Manga109)--output-path: path to the output folder (default:data/datasets.unify/Manga109)--override: override the existing images, annotations are always overwritten (default:False)--limit: stop after the first{limit}books (default:None)
After downloading the dataset from here, unzip DCM_dataset_public_images.zip into the folder datasets. Rename the extracted directory as DCM and delete the zip file.
The DCM dataset needs to be preprocessed before being converted into the unified format. To preprocess the DCM dataset (jpg renaming) and then convert images to the unified format, run the following command:
$ python comix/process/dcm.pyIn DCM original enumeration of images starts from '001' rather than '000'. We decided to keep it.
Download the original page images. Unzip raw_pages_images.tar.gz into the folder datasets and rename the extracted folder books. Then, move this folder in an upper new created directory named comics.
The folder hierarchy should look like this:
datasets/
└── comics
├── booksThe Comics dataset needs to be preprocessed before being converted into the unified format. To preprocess the Comics dataset (jpg renaming) and then convert images to the unified format, run the following command:
$ python comix/process/comics.pyIn Comics dataset some images are not viewable (usually first/last ones). We renamed them anyway.
Check the comix/process/comics.py file for the arguments, if you want to change the default values.
To download the dataset, please refer to magi repository. After downloading, please locate Popmanga folder into data/datasets and rename it into popmanga. Then, inside the folder delete annotations.
Now, you can convert the images of comics to the unified format by running the following command:
$ python comix/process/popmanga.pyIn the path data/datasets.unify/name_of_the_dataset/splits the splits are available for every dataset except for Manga109. In particular val.csv and test.csv are available for every dataset. Furthermore, in comics there is also train.csv