We provide a Lightning-based framework to train a detector on the AI-GenBench.
The proposed AI-GenBench benchmark requires a detector to be trained on sliding windows of 4 generators, ordered chronologically. For more info, please refer to our paper AI-GenBench: A New Ongoing Benchmark for AI-Generated Image Detection. This framework can also be used to train on the dataset without following the benchmark protocol.
- Clone the repository.
- Follow the guide to install training/evaluation requirements (see the README in the root of the repository) in a Python environment
- Follow the Training section below
You can either follow the benchmark protocol or train a model directly on the whole dataset. The following sections describe both options.
- Create a
local_config.yamlfile. This file will contain the local configuration parameters, such as the dataset path (but you can also set other configuration values). The file should look like this:trainer: precision: "bf16-mixed" dataset_path: "/folder/subfolder/.../ai_gen_bench_v1.0.0" - (Optional) Adapt the parameters for one of the already provided models. The smallest model we evaluated in our paper is OpenAI ResNet50 CLIP, whose configuration is in
RN50_clip_tune_resize.yaml. The configuration files for other models are similar. - Run the training script. We recommend using
run_training.sh, but you may also do it manually:Just make sure that thepython lightning_main.py fit \ --config training_configurations/benchmark_pipelines/base_benchmark_sliding_windows.yaml \ --config training_configurations/RN50_clip/RN50_clip_tune_resize.yaml \ --config local_config.yaml \ --experiment_info.experiment_id <experiment_id>
benchmark_pipelinesconfig is before the model config. - Logs will be stored in the
experiments_logsfolder, while predictions and checkpoints will be stored in theexperiments_datafolder.- By default, predictions will be saved at the end of the "validate" and "test" phases as a
.npzfile, which can be easily loaded usingNumPy. - The framework will create multiple folders in
experiments_datato store the results of each sliding window. Each will have its own checkpoint and predictions file. - When using W&B Logger, each experiment will have its own "group" set, so you can easily group the sliding windows runs in the online dashboard.
- When using TensoboardLogger, logs will be stored in
experiments_logs. For each experiment, each window will have its ownwindow_Nfolder. Hint: you can aggregate all tfevent files in a single folder to visualize them as a single experiment :). - It is recommended you set the
--experiment_info.experiment_idparameter to a meaningful value from the command line, so that you can easily identify the experiment in the logs. By default, a incremental id will be set (it's the default LightningCLI behavior).
- By default, predictions will be saved at the end of the "validate" and "test" phases as a
If you just want to train a detector on the whole dataset without following the proposed benchmark protocol, you can do so by following the steps detailed in Benchmark protocol with the following differences:
- Set the dataset path in
base_benchmark_full_training.yaml. - Run the training script as described above.
There are situations in which you may want to pause and then resume a training. The framework should already take care of SLURM preemption and re-queue if enabled (the experiment id will be automatically detected as the job id), but also resuming an experiment stopped using external signals (CTRL-C) should work.
In the last case, you can resume a training by running the same command you used to start the training, but adding the experiment id:
python lightning_main.py fit \
--config training_configurations/benchmark_pipelines/base_benchmark_sliding_windows.yaml \
--config training_configurations/RN50_clip/RN50_clip_tune_resize.yaml \
--config local_config.yaml \
--experiment_info.experiment_id <experiment_id>You can customize the training procedure mainly by modifying the model configuration files such as RN50_clip_tune_resize.yaml. Those are the most relevant configuration values to consider:
model.model_name: must be set to a value recognized by the model factory (see below for more details). All models come in the_probeand_tunevariants. The_probevariant is used to train the only the last layer of the model, while the_tunevariant is used to train the full model.optimizer: you can set any optimizer you want. Set theclass_pathto the optimizer you want to use and its hyperparameters ininit_args.scheduler: you will need to implement schedulers manually in the code. Here we already implementedOneCycleLR, which is configured as a string and actually implemented in theconfigure_optimizersmethod of the model.model_input_size: defaults to224as most models will work with 224x224 images, but you can customize the input size here.classification_threshold: only affects some metrics (accuracy, precision, recall). AUROC is not affected. Doesn't affect the training process.training_cropping_strategy/evaluation_cropping_strategy: you can set any cropping strategy you want. In our paper we found that theresizestrategy works generally better. You will find configuration files for bothresizeandcropin thetraining_configurationsfolder. Valid values are:- training_cropping_strategy:
resize,random_crop,center_crop,as_is - evaluation_cropping_strategy:
resize,crop(central),multicrop(implemented as FiveCrop by default, can be customized in the codebase),as_is
- training_cropping_strategy:
- Keep an eye to
trainer.accumulate_grad_batchesanddata.train_batch_size: those are important parameters to set the batch size. Thetrainer.accumulate_grad_batchesparameter is used to set the number of batches to accumulate before performing an update pass. This is useful when you want to simulate a larger batch size than what your GPU can handle. Thedata.train_batch_sizeparameter is used to set the batch size for the training data loader. The overall batch size is the product of those two parameters. Also, if you are using multiple GPUs, the overall batch size will be multiplied by the number of GPUs. So, if you are using 4 GPUs and settrainer.accumulate_grad_batchesto 2 anddata.train_batch_sizeto 32, the overall batch size will be 4 * 2 * 32 = 256. We ran all paper experiments with an overall batch size of 512.
IMPORTANT: here we refer to the process of registering a model architecture such as a new ResNet, ViT, EfficientNet, etcetera, in the model factory. This is not the same as customizing the provided LightningModule BaseDeepfakeDetectionModel. That advanced step is discussed later.
You don't need to implement a new LightningModule class (LightningModule is the superclass used in Lightning to define the training and evaluation logic) to add a new model . Instead, you can follow these steps (you can check dinov2.py for a general template):
- In
algorithms/models, create a new file calledyour_model.py. - In
your_model.py, implement the model factory, such as:The factory should return, for recognized model names, a PyTorch model or None/raise an error if the model name is not recognized.def make_dinov2_model(model_name: str, pretrained: bool = True, **kwargs): ...
- Register your model factory (the first parameter is just a name you want to use to refer to the factory):
from algorithms.model_factory_registry import ModelFactoryRegistry ModelFactoryRegistry().register_model_factory("dinov2", make_dinov2_model)
- Import your script somewhere in the codebase. For example, in the
__init__.pyfile of thealgorithms/modelsfolder. This will ensure that your model factory is registered when the package is imported. - That's it! From now on you can use your model by setting the
model_namein the config files. - (Advanced) For all the pre-implemented models (RN50_CLIP, DINOv2, OpenViT-L-14) we created two versions:
probeandtune. Inprobe, we "hid" the backbone so that it's not listed in the sub-modules and parameters (and thus is not accidentaly taken into consideration during training). You might not need to implement a probe-only version of your model, but if you need it you follow the general schema found in dinov2.py.- Setting
requires_gradtoFalseis not enough to freeze a part of the model: you'll also need toeval()the modules to be frozen of otherwise the stats of normalization layers will change (they are not frozen by settingrequires_grad=False). Also, for frozen parts of the model, it would be better to forward inputs usingtorch.no_grad()ortorch.inference_mode()to avoid unnecessary memory usage and computation. This is not strictly necessary, but it's a good practice.
- Setting
- DINOv2 models, either
_probeor_tune. Example:dinov2_vits14_tune. - OpenAI CLIP models models, either
_probeor_tune. Example:RN50_tune. - A subset of OpenCLIP models, either
_probeor_tune. See openclip_models.py for the supported ones. Example:clipL14commonpool_tune. - timm models, either
_probeor_tune. Example:convnext_xxlarge.clip_laion2b_soup_ft_in1k_tune - torchvision models, either
_probeor_tune. Example:resnet50_tune
This is an advanced step. You can customize the training and evaluation logic in the BaseDeepfakeDetectionModel by either modifying, subclassing, or even just using it as a template. The class is already documented in-code to guide you through the process.
The main things you may want to change are:
training_step(): the default implementation already takes care of computing the loss and updating metrics. The only strict requirement is that it must return the loss.evaluation_step(): the default implementation already takes care of computing the loss, fusing the scores if using multi_crop evaluation, and updating metrics. Note:evaluation_stepis a unified method for both validation and test. The only strict requirement is that it should return (predictions, labels, generator_ids, image_identifiers, losses).scores_fusion(): only meaningful when running a multicrop evaluation. The default implementation computes the mean score over the crops.configure_optimizers(): the default implementation already takes care of setting the optimizer and the OneCycleLR scheduler.- The optimizer_factory is a factory (created by Lightning from the YAML configuration) that accepts the parameter groups and returns the optimizer.
- You will need to manually implement the scheduler. Follow the implementation of OneCycleLR as a general template. Note that in Lightning it's not exactly straightforward to implement a scheduler, so you may need to play around a bit.
- Consider customizing
lr_scheduler_step()only if you really need a custom scheduler step function.
register_metrics(): you can override this method to add new metrics. Just remember to callsuper().register_metrics()to configure the default metrics! Check theregister_metricsimplementation in the superclass to see how to add new metrics. You can use any metric from thetorchmetricslibrary, but you can also implement your own metrics.- We are planning to make this process more intuitive in the future.
train_augmentation(): must return 1 or 2 augmentations (callables):- The first one will be the deterministic part of the pipeline (actually, it will be deterministic only if
deterministic_augmentationsistruein the configuration) - The second, optional, one is the stochastic part of the pipeline. If you want to follow the benchmark protocol, only limited agumentations are allowed to be in this part: flip, rotation, crop.
- The first one will be the deterministic part of the pipeline (actually, it will be deterministic only if
val/test/predict_augmentation(): must return 2 augmentations (callables). Both will be executed in a deterministic context ifdeterministic_augmentationsistruein the configuration. To understand this, consider that this is how the augmentation pipeline looks like:first_augmentation, second_augmentation = model.val_augmentation() # mandatory augmentations are defined in the benchmark pipeline class img = pipeline.mandatory_eval_processing(img) # 1 img = first_augmentation(img) # 2 # make_val_crops is a method of the model, explained below crops = model.make_val_crops(img) result_crops = [] for crop in crops: result_crops.append(second_augmentation(crop)) # 3 return result_crops
- Firstly, the mandatory augmentations (defined by the benchmark pipeline) are executed (see below).
- The first augmentation is applied before the "crop/multicrop/resize" part of the pipeline.
- The second augmentation is applied separatedly to each crop.
make_eval_crops(): this method is used to create the crops (for evaluation only). The default implementation already takes care of resizeing or cropping or multi-cropping the image based on theevaluation_cropping_strategyconfiguration value. You can override it if you want to change the cropping strategy.
That's it! Just remember that, in ai_genbench_pipeline.py, a part of the evaluation augmentations are mandatory (only if you want to follow the benchmark protocol, of course). The mandatory augmentations are designed to create images that could be realistically considered as "real" and thus used on social media posts/newspapers/..., while still adding distorsion, compressions, etcetera. You can find the mandatory augmentations in the mandatory_val_preprocessing() method of the AIGenBenchPipeline class. You can still add your own augmentations in the model's val/test/predict_augmentation() methods, that will be executed after the mandatory ones (mandatory ones are used to mimic a situation in which the detection system receives images as they were already distorted using that pipeline).
- Our paper, AI-GenBench: A New Ongoing Benchmark for AI-Generated Image Detection, is available on:
- IJCNN 2025 proceedings (Verimedia workshop) (to be published)
- arXiv
- For an up-to-date leaderboard of the benchmark, please refer to the README in the root of the repository