Skip to content

Commit 7cc13c1

Browse files
authored
Merge pull request #26 from gdcc/pidchecks
Scripts to detect use of draft PIDs and other not found errors
2 parents 29b1fd8 + 2c9381f commit 7cc13c1

3 files changed

Lines changed: 219 additions & 0 deletions

File tree

python/pid_reports/README.md

Lines changed: 71 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,71 @@
1+
# PID Reports
2+
3+
## Overview
4+
5+
This folder contains scripts that generate reports on Persistent Identifier (PID) usage in a Dataverse instance. These scripts specifically identify cases where PIDs were not found, which may indicate:
6+
7+
- In-the-wild use of draft PIDs
8+
- Posting of PIDs with typos
9+
- PIDs with extra characters (e.g., trailing periods)
10+
- Other malformed PID references
11+
12+
## Scripts
13+
14+
The main script in this folder is `dcpidreport.py`, which checks DataCite for DOI resolution and generates reports on failures. Anything reported via this script indicates that someone tried to resolve the specified DOI, i.e. via https://doi.org/* . DataCite can sometimes be more than a month delayed in updating its reports - the script is able to handle this.
15+
16+
A second script, `pidreport.py` performs similar functions for any PIDs. However, it relies on [functionality to create an initial PIDFailures report](https://github.com/IQSS/dataverse/pull/11601) that is not yet merged into the standard Dataverse distribution from https://github.org/IQSS/dataverse.
17+
The benefits of this report are that the results are available for any kind of PID, are available every month, and capture any call to Dataverse requiring a PID (i.e. where someone may have posted a direct, incorrect link to a dataset page, versus DataCite only reporting DOI resolution failures).
18+
19+
## Purpose
20+
21+
These reports help maintain the integrity of your Dataverse's persistent identifier system by:
22+
- Identifying problematic PID references
23+
- Alerting administrators to potential issues
24+
- Providing data for troubleshooting and correction
25+
26+
## Usage
27+
28+
The scripts are designed to be run periodically (typically monthly) via a cron job.
29+
30+
### Configuration
31+
32+
Before using these scripts, you need to configure several variables in each script:
33+
34+
#### For dcpidreport.py:
35+
36+
1. **File paths**:
37+
- Update the `filename` variable to point to your desired state file location
38+
39+
2. **Dataverse configuration**:
40+
- `doi_account`: Your DataCite account prefix (e.g., "GDCC.YOUR_ACCOUNT")
41+
- `dataverse_base_url`: The base URL of your Dataverse installation (e.g., "https://data.yourdataverse.org")
42+
43+
3. **Email configuration**:
44+
- `receivers`: Email addresses that should receive the reports
45+
- `smtp_server`: Your SMTP server address
46+
- `port`: SMTP port (default is 465 for SSL)
47+
- `sender_email`: Email address from which reports will be sent
48+
- `username`: SMTP authentication username
49+
- `password`: SMTP authentication password
50+
51+
#### For pidreport.py:
52+
53+
1. **File paths**:
54+
- `log_dir`: Directory where PID failure logs are stored
55+
56+
2. **Dataverse configuration**:
57+
- `dataverse_base_url`: The base URL of your Dataverse installation
58+
59+
3. **Email configuration**:
60+
- Same as dcpidreport.py (receivers, smtp_server, port, sender_email, username, password)
61+
62+
4. **IP blacklist configuration**:
63+
- `blacklist`: List of IP addresses to exclude from reports (e.g., known scanners such as UT Dorkbot / autoscan.infosec.utexas.edu that test with incorrect PIDs)
64+
65+
### Cron Configuration
66+
67+
To set up a monthly cron job that runs on the 1st day of each month, add something similar to the following to your crontab:
68+
69+
10 5 1 * * python3 /opt/pidreporting/pidreport.py >> /var/log/pidreport.log 2>&1
70+
12 5 1 * * /usr/bin/python3 /opt/pidreporting/dcpidreport.py >> /var/log/dcpidreport.log 2>&1
71+

python/pid_reports/dcpidreport.py

Lines changed: 83 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,83 @@
1+
import shutil
2+
import tempfile
3+
import urllib.request
4+
import gzip
5+
from datetime import datetime
6+
import os.path
7+
from urllib.error import HTTPError
8+
from dateutil.relativedelta import *
9+
import ssl, smtplib
10+
11+
currentmonth=datetime.now().replace(day=1) + relativedelta(days=-1)
12+
processmonth=currentmonth
13+
14+
filename="/opt/pidreporting/dcpidreportstate"
15+
if os.path.exists(filename):
16+
with open(filename) as f:
17+
line=next(f)
18+
processmonth = datetime.strptime(line.strip('\n'),"%m_%Y")
19+
processmonth = processmonth + relativedelta(months=+1)
20+
21+
# Configuration variables
22+
receivers = "admin@mydataverse.org,support@myinstitution.org" # Enter receiver address
23+
doi_account = "GDCC.YOUR_ACCOUNT" # Replace with your DataCite account prefix
24+
dataverse_base_url = "https://data.yourdataverse.org" # Replace with your Dataverse installation URL
25+
26+
# mail config
27+
port = 465 # For SSL
28+
smtp_server = "" #Enter your SMTP server address, e.g. email-smtp.us-east-1.amazonaws.com
29+
sender_email = "" # Enter your address
30+
username = ""
31+
password = ""
32+
33+
message = "Subject: DataCite DOI Resolution Failure Reports\nTo: " + receivers + "\n\n"
34+
35+
found=True
36+
somereports=False
37+
while (processmonth <= currentmonth) and found:
38+
monthstr = processmonth.strftime("%m_%Y")
39+
40+
try:
41+
with urllib.request.urlopen('https://stats.datacite.org/stats/resolution-report/resolutions_' + monthstr + '.html') as response:
42+
with tempfile.NamedTemporaryFile(delete=False) as tmp_file:
43+
gzip_fd = gzip.GzipFile(fileobj=response)
44+
shutil.copyfileobj(gzip_fd, tmp_file)
45+
message = message + "Report for " + monthstr +"\n\nHits\tDOI\tURI\n(Note: clicking links will record new failures unless these are drafts)\n"
46+
with open(tmp_file.name) as html:
47+
for line in html:
48+
if doi_account in line:
49+
rightlist=False;
50+
done=False
51+
for line in html:
52+
if "</ol>" in line:
53+
rightlist=True;
54+
elif ("<ol>" in line) and rightlist:
55+
somereports=True
56+
for line in html:
57+
if line.startswith("<a "):
58+
failpid = "doi:" + line.split('"')[1].split("doi.org/")[1]
59+
linewithcount=next(html)
60+
linewithcount=next(html)
61+
message=message + "\n" + linewithcount.replace(')','(').split("(")[1] + "\t" + failpid + "\t" + dataverse_base_url + "/dataset.xhtml?persistentId=" + failpid
62+
elif "</ol>" in line:
63+
done=True
64+
break
65+
if done:
66+
break
67+
if done:
68+
break
69+
with open(filename, "w") as f:
70+
f.write(processmonth.strftime("%m_%Y"))
71+
message= message + "\n\n"
72+
processmonth=processmonth + relativedelta(months=+1)
73+
except urllib.error.HTTPError as err:
74+
found=False
75+
if not somereports:
76+
message=message + "No new monthly reports from DataCite. Next report expected: " + monthstr + "\n\n"
77+
78+
79+
context = ssl.create_default_context()
80+
with smtplib.SMTP_SSL(smtp_server, port, context=context) as server:
81+
server.login(username, password)
82+
server.sendmail(sender_email, receivers.split(","), message)
83+

python/pid_reports/pidreport.py

Lines changed: 65 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,65 @@
1+
import smtplib, ssl, datetime, os.path
2+
3+
# Configuration variables
4+
# File paths
5+
log_dir = "/usr/local/payara6/domains/domain1/logs" # Replace with your log directory path
6+
7+
# Dataverse configuration
8+
dataverse_base_url = "https://data.yourdataverse.org" # Replace with your Dataverse installation URL
9+
10+
# Email configuration
11+
receivers = "admin@mydataverse.org,support@myinstitution.org" # Enter receiver addresses
12+
port = 465 # For SSL
13+
smtp_server = "smtp.example.com" # Replace with your SMTP server
14+
sender_email = "sender@example.com" # Enter your address
15+
username = "your_username" # Replace with your SMTP username
16+
password = "your_password" # Replace with your SMTP password
17+
18+
blacklist=[]
19+
blacklist.append("146.6.15.11") #UT Dorkbot / autoscan.infosec.utexas.edu
20+
21+
22+
def numSort(s):
23+
return int(s[0:s.index("_")])
24+
25+
26+
if os.path.exists(filename):
27+
d={}
28+
blcount=0
29+
with open(filename) as f:
30+
for line in f.readlines()[1:]:
31+
(pid, uri, method, ip, time)=line.split("\t")
32+
if pid not in d and ip not in blacklist:
33+
d[pid] = []
34+
if ip not in blacklist:
35+
d[pid].append(method + " " + uri + " from " + ip + " at " + time)
36+
else:
37+
blcount = blcount + 1
38+
39+
l=[]
40+
for key in d:
41+
l.append(str(len(d[key])) + "_" + key)
42+
43+
l.sort(reverse=True, key=numSort)
44+
45+
message = message + "Hits\tDOI\tURI\n(Note: clicking links will record new failures unless these are drafts)\n"
46+
47+
for val in l:
48+
doi = val[val.index("_")+1:]
49+
message = message + "\n" + str(numSort(val)) + "\t" + doi + "\t" + dataverse_base_url + "/dataset.xhtml?persistentId=" + doi
50+
51+
message = message + "\n\nDetails:\n\n"
52+
53+
if blcount is not 0:
54+
message = message + str(blcount) + "entries (not reported) from blacklisted IP addresses (e.g. UT Dorkbot)\n\n"
55+
for val in l:
56+
doi = val[val.index("_")+1:]
57+
message = message + doi + "\n\t" + "\n\t".join(d[doi]) + "\n"
58+
else:
59+
message= message + "No Failures this month\n\n"
60+
61+
context = ssl.create_default_context()
62+
with smtplib.SMTP_SSL(smtp_server, port, context=context) as server:
63+
server.login(username, password)
64+
server.sendmail(sender_email, receivers.split(","), message)
65+

0 commit comments

Comments
 (0)