|
| 1 | +# PID Reports |
| 2 | + |
| 3 | +## Overview |
| 4 | + |
| 5 | +This folder contains scripts that generate reports on Persistent Identifier (PID) usage in a Dataverse instance. These scripts specifically identify cases where PIDs were not found, which may indicate: |
| 6 | + |
| 7 | +- In-the-wild use of draft PIDs |
| 8 | +- Posting of PIDs with typos |
| 9 | +- PIDs with extra characters (e.g., trailing periods) |
| 10 | +- Other malformed PID references |
| 11 | + |
| 12 | +## Scripts |
| 13 | + |
| 14 | +The main script in this folder is `dcpidreport.py`, which checks DataCite for DOI resolution and generates reports on failures. Anything reported via this script indicates that someone tried to resolve the specified DOI, i.e. via https://doi.org/* . DataCite can sometimes be more than a month delayed in updating its reports - the script is able to handle this. |
| 15 | + |
| 16 | +A second script, `pidreport.py` performs similar functions for any PIDs. However, it relies on [functionality to create an initial PIDFailures report](https://github.com/IQSS/dataverse/pull/11601) that is not yet merged into the standard Dataverse distribution from https://github.org/IQSS/dataverse. |
| 17 | +The benefits of this report are that the results are available for any kind of PID, are available every month, and capture any call to Dataverse requiring a PID (i.e. where someone may have posted a direct, incorrect link to a dataset page, versus DataCite only reporting DOI resolution failures). |
| 18 | + |
| 19 | +## Purpose |
| 20 | + |
| 21 | +These reports help maintain the integrity of your Dataverse's persistent identifier system by: |
| 22 | +- Identifying problematic PID references |
| 23 | +- Alerting administrators to potential issues |
| 24 | +- Providing data for troubleshooting and correction |
| 25 | + |
| 26 | +## Usage |
| 27 | + |
| 28 | +The scripts are designed to be run periodically (typically monthly) via a cron job. |
| 29 | + |
| 30 | +### Configuration |
| 31 | + |
| 32 | +Before using these scripts, you need to configure several variables in each script: |
| 33 | + |
| 34 | +#### For dcpidreport.py: |
| 35 | + |
| 36 | +1. **File paths**: |
| 37 | + - Update the `filename` variable to point to your desired state file location |
| 38 | + |
| 39 | +2. **Dataverse configuration**: |
| 40 | + - `doi_account`: Your DataCite account prefix (e.g., "GDCC.YOUR_ACCOUNT") |
| 41 | + - `dataverse_base_url`: The base URL of your Dataverse installation (e.g., "https://data.yourdataverse.org") |
| 42 | + |
| 43 | +3. **Email configuration**: |
| 44 | + - `receivers`: Email addresses that should receive the reports |
| 45 | + - `smtp_server`: Your SMTP server address |
| 46 | + - `port`: SMTP port (default is 465 for SSL) |
| 47 | + - `sender_email`: Email address from which reports will be sent |
| 48 | + - `username`: SMTP authentication username |
| 49 | + - `password`: SMTP authentication password |
| 50 | + |
| 51 | +#### For pidreport.py: |
| 52 | + |
| 53 | +1. **File paths**: |
| 54 | + - `log_dir`: Directory where PID failure logs are stored |
| 55 | + |
| 56 | +2. **Dataverse configuration**: |
| 57 | + - `dataverse_base_url`: The base URL of your Dataverse installation |
| 58 | + |
| 59 | +3. **Email configuration**: |
| 60 | + - Same as dcpidreport.py (receivers, smtp_server, port, sender_email, username, password) |
| 61 | + |
| 62 | +4. **IP blacklist configuration**: |
| 63 | + - `blacklist`: List of IP addresses to exclude from reports (e.g., known scanners such as UT Dorkbot / autoscan.infosec.utexas.edu that test with incorrect PIDs) |
| 64 | + |
| 65 | +### Cron Configuration |
| 66 | + |
| 67 | +To set up a monthly cron job that runs on the 1st day of each month, add something similar to the following to your crontab: |
| 68 | + |
| 69 | +10 5 1 * * python3 /opt/pidreporting/pidreport.py >> /var/log/pidreport.log 2>&1 |
| 70 | +12 5 1 * * /usr/bin/python3 /opt/pidreporting/dcpidreport.py >> /var/log/dcpidreport.log 2>&1 |
| 71 | + |
0 commit comments