Skip to content

Latest commit

 

History

History
221 lines (150 loc) · 8.16 KB

File metadata and controls

221 lines (150 loc) · 8.16 KB

sdmdl — Species Distribution Modelling with Deep Learning

sdmdl is an object-oriented Python package for species distribution modelling (SDM) using deep neural networks (DNNs). It provides a high-level interface for modelling species' environmental preferences across many abiotic and biotic variables, training binary classification DNNs with dropout regularisation, and generating global distribution predictions.

The package was built to maximise ease of use while still offering in-depth parameter control. The main entry point is a single sdmdl class with four methods that cover the complete workflow:

from sdmdl.sdmdl_main import sdmdl

model = sdmdl('/path/to/repository_root')
model.prep()       # data preparation
model.train()      # model training
model.predict()    # distribution prediction
model.clean()      # remove temporary files

Further customisation is available through a config.yml file that controls model hyper-parameters (see Configuration).

Installation

Prerequisites

GDAL must be installed as a system dependency, including the GDAL Python bindings. On Ubuntu/Debian:

sudo add-apt-repository ppa:ubuntugis/ubuntugis-unstable
sudo apt-get update
sudo apt-get install gdal-bin libgdal-dev
pip install GDAL==$(gdal-config --version)

Install sdmdl

Clone the repository and install:

git clone https://github.com/naturalis/sdmdl.git
cd sdmdl
pip install .

The Python dependencies listed in requirements.txt will be installed automatically.

Input requirements

To create an sdmdl object and subsequently train models, the following inputs are required:

  1. Environmental raster layers (.tif) placed in the appropriate directories:

    • data/gis/layers/scaled/ — layers that need to be standardised during preprocessing.
    • data/gis/layers/non-scaled/ — layers that are already normalised or categorical (e.g. 0 = absent, 1 = present).

    Example datasets are available on Zenodo: environmental rasters.

    Note

    All environmental layers must share the same affine transformation and resolution. This includes the bundled empty_land_map.tif in data/gis/layers/. If you supply your own rasters, ensure they match the affine transformation and resolution of empty_land_map.tif (or vice versa).

  2. Occurrence tables (.csv, .xls, or .xlsx) placed in data/occurrences/. Each table must contain two required columns:

    • decimalLatitude (or decimallatitude) — latitude for each occurrence.
    • decimalLongitude (or decimallongitude) — longitude for each occurrence.

    Coordinates must be in the WGS 84 coordinate system. Example datasets are available on Zenodo: occurrence datasets.

    Warning

    Occurrence coordinates are not validated before data preparation. Incorrect data types (non-numerical values) or coordinates outside the spatial extent of the raster files will cause errors.

Directory layout summary:

  • Scaled .tif layers → data/gis/layers/scaled/
  • Non-scaled .tif layers → data/gis/layers/non-scaled/
  • Occurrence tables → data/occurrences/

Configuration

A config.yml file is generated automatically on first use in the data/ directory. It stores detected raster files, detected occurrence files, and model parameters:

  • random_seed (int): seed for reproducibility (default: 42).
  • pseudo_freq (int): number of pseudo-absence samples (default: 2000).
  • batchsize (int): training batch size (default: 75).
  • epoch (int): number of training epochs (default: 150).
  • model_layers (list of int): nodes per hidden layer; adding items deepens the network (default: [250, 200, 150, 100]).
  • model_dropout (list of float): dropout rate per hidden layer; 0 = no dropout, 1 = full dropout (default: [0.3, 0.5, 0.3, 0.5]).
  • verbose (bool): if True, display progress bars (default: True).

Data paths and detected files can also be customised in config.yml.

Note

Changes to config.yml are not picked up automatically. A new sdmdl object must be created for changes to take effect.

Usage example

Step 1: Create an sdmdl object:

from sdmdl.sdmdl_main import sdmdl

model = sdmdl('/path/to/repository_root')

Step 2: Prepare data (presence maps, raster stack, pseudo-absences, training and prediction datasets):

model.prep()

Step 3: Train deep neural network models for each species:

model.train()

Step 4: Predict global species distributions:

model.predict()

Step 5: Remove temporary intermediate files:

model.clean()

Outputs

Data preparation (prep)

Several temporary intermediate files are created and used as inputs for training and prediction.

Training (train)

  • Performance metricsresults/_DNN_performance/DNN_eval.txt contains per-species accuracy, loss, AUC, true positive rate, and 95 % confidence intervals.
  • Model files — for each species, a .h5 (weights) and .json (architecture) file is saved under results/<species_name>/.
  • Feature importance — a SHAP-based feature impact plot (.png) per species, saved under results/<species_name>/.

Prediction (predict)

  • Prediction map — a colour-mapped .png visualisation of the predicted distribution per species, saved under results/<species_name>/.
  • Prediction raster — a GeoTIFF (.tif) with the predicted probability of presence (0–1) per species, saved under results/<species_name>/.

Package architecture

The package is organised into the following modules:

  • sdmdl.sdmdl_main — the main sdmdl class orchestrating the full workflow.
  • sdmdl.sdmdl.configConfig class for managing config.yml.
  • sdmdl.sdmdl.occurrencesOccurrences class for managing occurrence data.
  • sdmdl.sdmdl.gisGIS class for managing raster layer paths and output locations.
  • sdmdl.sdmdl.trainerTrainer class for DNN training and evaluation.
  • sdmdl.sdmdl.predictorPredictor class for generating distribution predictions.
  • sdmdl.sdmdl.data_prep — data preparation sub-package:
    • presence_map — creates per-species presence rasters.
    • raster_stack — stacks environmental layers into a single GeoTIFF.
    • presence_pseudo_absence — samples pseudo-absence locations.
    • band_statistics — computes band-wise mean and standard deviation.
    • training_data — prepares per-species training datasets.
    • prediction_data — prepares the global prediction dataset.

References

This package implements the deep-learning SDM approach described in:

Rademaker, M., Hogeweg, L., & Vos, R. (2019). Modelling the niches of wild and domesticated Ungulate species using deep learning. bioRxiv, 744441. doi:10.1101/744441

Related repositories:

License

This project is licensed under the MIT License. Copyright © 2019 Naturalis Biodiversity Center.