Skip to content

Commit 8b4100d

Browse files
authored
Merge pull request #9614 from IQSS/8889-filepids-in-collections
8889 file-level PIDs configuration in individual collections
2 parents 0cad39a + 59cd7ab commit 8b4100d

20 files changed

Lines changed: 398 additions & 17 deletions

File tree

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,3 @@
1+
It is now possible to configure registering PIDs for files in individual collections.
2+
3+
For example, registration of PIDs for files can be enabled in a specific collection when it is disabled instance-wide. Or it can be disabled in specific collections where it is enabled by default. See the [:FilePIDsEnabled](https://guides.dataverse.org/en/latest/installation/config.html#filepidsenabled) section of the Configuration guide for details.

doc/sphinx-guides/source/admin/dataverses-datasets.rst

Lines changed: 21 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -153,15 +153,32 @@ Mint a PID for a File That Does Not Have One
153153
In the following example, the database id of the file is 42::
154154

155155
export FILE_ID=42
156-
curl http://localhost:8080/api/admin/$FILE_ID/registerDataFile
156+
curl "http://localhost:8080/api/admin/$FILE_ID/registerDataFile"
157157

158-
Mint PIDs for Files That Do Not Have Them
159-
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
158+
Mint PIDs for all unregistered published files in the specified collection
159+
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
160160

161-
If you have a large number of files, you might want to consider miniting PIDs for files individually using the ``registerDataFile`` endpoint above in a for loop, sleeping between each registration::
161+
The following API will register the PIDs for all the yet unregistered published files in the datasets **directly within the collection** specified by its alias::
162+
163+
curl "http://localhost:8080/api/admin/registerDataFiles/{collection_alias}"
164+
165+
It will not attempt to register the datafiles in its sub-collections, so this call will need to be repeated on any sub-collections where files need to be registered as well. File-level PID registration must be enabled on the collection. (Note that it is possible to have it enabled for a specific collection, even when it is disabled for the Dataverse installation as a whole. See :ref:`collection-attributes-api` in the Native API Guide.)
166+
167+
This API will sleep for 1 second between registration calls by default. A longer sleep interval can be specified with an optional ``sleep=`` parameter::
168+
169+
curl "http://localhost:8080/api/admin/registerDataFiles/{collection_alias}?sleep=5"
170+
171+
Mint PIDs for ALL unregistered files in the database
172+
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
173+
174+
The following API will attempt to register the PIDs for all the published files in your instance that do not yet have them::
162175

163176
curl http://localhost:8080/api/admin/registerDataFileAll
164177

178+
The application will attempt to sleep for 1 second between registration attempts as not to overload your persistent identifier service provider. Note that if you have a large number of files that need to be registered in your Dataverse, you may want to consider minting file PIDs within indivdual collections, or even for individual files using the ``registerDataFiles`` and/or ``registerDataFile`` endpoints above in a loop, with a longer sleep interval between calls.
179+
180+
181+
165182
Mint a New DOI for a Dataset with a Handle
166183
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
167184

doc/sphinx-guides/source/api/native-api.rst

Lines changed: 18 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -738,6 +738,24 @@ The fully expanded example above (without environment variables) looks like this
738738
739739
curl -H X-Dataverse-key:xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx https://demo.dataverse.org/api/dataverses/root/guestbookResponses?guestbookId=1 -o myResponses.csv
740740
741+
.. _collection-attributes-api:
742+
743+
Change Collection Attributes
744+
~~~~~~~~~~~~~~~~~~~~~~~~~~~~
745+
746+
.. code-block::
747+
748+
curl -X PUT -H "X-Dataverse-key:$API_TOKEN" "$SERVER_URL/api/dataverses/$ID/attribute/$ATTRIBUTE?value=$VALUE"
749+
750+
The following attributes are supported:
751+
752+
* ``alias`` Collection alias
753+
* ``name`` Name
754+
* ``description`` Description
755+
* ``affiliation`` Affiliation
756+
* ``filePIDsEnabled`` ("true" or "false") Enables or disables registration of file-level PIDs in datasets within the collection (overriding the instance-wide setting).
757+
758+
741759
Datasets
742760
--------
743761

doc/sphinx-guides/source/installation/config.rst

Lines changed: 3 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -2766,13 +2766,14 @@ timestamps.
27662766
:FilePIDsEnabled
27672767
++++++++++++++++
27682768

2769-
Toggles publishing of file-based PIDs for the entire installation. By default this setting is absent and Dataverse Software assumes it to be true. If enabled, the registration will be performed asynchronously (in the background) during publishing of a dataset.
2769+
Toggles publishing of file-level PIDs for the entire installation. By default this setting is absent and Dataverse Software assumes it to be true. If enabled, the registration will be performed asynchronously (in the background) during publishing of a dataset.
27702770

27712771
If you don't want to register file-based PIDs for your installation, set:
27722772

27732773
``curl -X PUT -d 'false' http://localhost:8080/api/admin/settings/:FilePIDsEnabled``
27742774

2775-
Note: File-level PID registration was added in Dataverse Software 4.9; it could not be disabled until Dataverse Software 4.9.3.
2775+
2776+
It is possible to override the installation-wide setting for specific collections. For example, registration of PIDs for files can be enabled in a specific collection when it is disabled instance-wide. Or it can be disabled in specific collections where it is enabled by default. See :ref:`collection-attributes-api` for details.
27762777

27772778
.. _:IndependentHandleService:
27782779

src/main/java/edu/harvard/iq/dataverse/DataFileServiceBean.java

Lines changed: 12 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -191,6 +191,18 @@ public List<DataFile> findByDatasetId(Long studyId) {
191191
.setParameter("studyId", studyId).getResultList();
192192
}
193193

194+
/**
195+
*
196+
* @param collectionId numeric id of the parent collection ("dataverse")
197+
* @return list of files in the datasets that are *direct* children of the collection specified
198+
* (i.e., no datafiles in sub-collections of this collection will be included)
199+
*/
200+
public List<DataFile> findByDirectCollectionOwner(Long collectionId) {
201+
String queryString = "select f from DataFile f, Dataset d where f.owner.id = d.id and d.owner.id = :collectionId order by f.id";
202+
return em.createQuery(queryString, DataFile.class)
203+
.setParameter("collectionId", collectionId).getResultList();
204+
}
205+
194206
public List<DataFile> findAllRelatedByRootDatafileId(Long datafileId) {
195207
/*
196208
Get all files with the same root datafile id

src/main/java/edu/harvard/iq/dataverse/Dataverse.java

Lines changed: 27 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -590,8 +590,34 @@ public void setCitationDatasetFieldTypes(List<DatasetFieldType> citationDatasetF
590590
this.citationDatasetFieldTypes = citationDatasetFieldTypes;
591591
}
592592

593-
593+
/**
594+
* @Note: this setting is Nullable, with {@code null} indicating that the
595+
* desired behavior is not explicitly configured for this specific collection.
596+
* See the comment below.
597+
*/
598+
@Column(nullable = true)
599+
private Boolean filePIDsEnabled;
594600

601+
/**
602+
* Specifies whether the PIDs for Datafiles should be registered when publishing
603+
* datasets in this Collection, if the behavior is explicitly configured.
604+
* @return {@code Boolean.TRUE} if explicitly enabled, {@code Boolean.FALSE} if explicitly disabled.
605+
* {@code null} indicates that the behavior is not explicitly defined, in which
606+
* case the behavior should follow the explicit configuration of the first
607+
* direct ancestor collection, or the instance-wide configuration, if none
608+
* present.
609+
* @Note: If present, this configuration therefore by default applies to all
610+
* the sub-collections, unless explicitly overwritten there.
611+
* @author landreev
612+
*/
613+
public Boolean getFilePIDsEnabled() {
614+
return filePIDsEnabled;
615+
}
616+
617+
public void setFilePIDsEnabled(boolean filePIDsEnabled) {
618+
this.filePIDsEnabled = filePIDsEnabled;
619+
}
620+
595621
public List<DataverseFacet> getDataverseFacets() {
596622
return getDataverseFacets(false);
597623
}

src/main/java/edu/harvard/iq/dataverse/api/Admin.java

Lines changed: 89 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1376,7 +1376,7 @@ public Response fixMissingOriginalTypes() {
13761376
"All the tabular files in the database already have the original types set correctly; exiting.");
13771377
} else {
13781378
for (Long fileid : affectedFileIds) {
1379-
logger.info("found file id: " + fileid);
1379+
logger.fine("found file id: " + fileid);
13801380
}
13811381
info.add("message", "Found " + affectedFileIds.size()
13821382
+ " tabular files with missing original types. Kicking off an async job that will repair the files in the background.");
@@ -1566,6 +1566,12 @@ public Response registerDataFileAll(@Context ContainerRequestContext crc) {
15661566
} catch (Exception e) {
15671567
logger.info("Unexpected Exception: " + e.getMessage());
15681568
}
1569+
1570+
try {
1571+
Thread.sleep(1000);
1572+
} catch (InterruptedException ie) {
1573+
logger.warning("Interrupted Exception when attempting to execute Thread.sleep()!");
1574+
}
15691575
}
15701576
logger.info("Final Results:");
15711577
logger.info(alreadyRegistered + " of " + count + " files were already registered. " + new Date());
@@ -1577,6 +1583,88 @@ public Response registerDataFileAll(@Context ContainerRequestContext crc) {
15771583
return ok("Datafile registration complete." + successes + " of " + released
15781584
+ " unregistered, published files registered successfully.");
15791585
}
1586+
1587+
@GET
1588+
@AuthRequired
1589+
@Path("/registerDataFiles/{alias}")
1590+
public Response registerDataFilesInCollection(@Context ContainerRequestContext crc, @PathParam("alias") String alias, @QueryParam("sleep") Integer sleepInterval) {
1591+
Dataverse collection;
1592+
try {
1593+
collection = findDataverseOrDie(alias);
1594+
} catch (WrappedResponse r) {
1595+
return r.getResponse();
1596+
}
1597+
1598+
AuthenticatedUser superuser = authSvc.getAdminUser();
1599+
if (superuser == null) {
1600+
return error(Response.Status.INTERNAL_SERVER_ERROR, "Cannot find the superuser to execute /admin/registerDataFiles.");
1601+
}
1602+
1603+
if (!systemConfig.isFilePIDsEnabledForCollection(collection)) {
1604+
return ok("Registration of file-level pid is disabled in collection "+alias+"; nothing to do");
1605+
}
1606+
1607+
List<DataFile> dataFiles = fileService.findByDirectCollectionOwner(collection.getId());
1608+
Integer count = dataFiles.size();
1609+
Integer countSuccesses = 0;
1610+
Integer countAlreadyRegistered = 0;
1611+
Integer countReleased = 0;
1612+
Integer countDrafts = 0;
1613+
1614+
if (sleepInterval == null) {
1615+
sleepInterval = 1;
1616+
} else if (sleepInterval.intValue() < 1) {
1617+
return error(Response.Status.BAD_REQUEST, "Invalid sleep interval: "+sleepInterval);
1618+
}
1619+
1620+
logger.info("Starting to register: analyzing " + count + " files. " + new Date());
1621+
logger.info("Only unregistered, published files will be registered.");
1622+
1623+
1624+
1625+
for (DataFile df : dataFiles) {
1626+
try {
1627+
if ((df.getIdentifier() == null || df.getIdentifier().isEmpty())) {
1628+
if (df.isReleased()) {
1629+
countReleased++;
1630+
DataverseRequest r = createDataverseRequest(superuser);
1631+
execCommand(new RegisterDvObjectCommand(r, df));
1632+
countSuccesses++;
1633+
if (countSuccesses % 100 == 0) {
1634+
logger.info(countSuccesses + " out of " + count + " files registered successfully. " + new Date());
1635+
}
1636+
} else {
1637+
countDrafts++;
1638+
logger.fine(countDrafts + " out of " + count + " files not yet published");
1639+
}
1640+
} else {
1641+
countAlreadyRegistered++;
1642+
logger.fine(countAlreadyRegistered + " out of " + count + " files are already registered. " + new Date());
1643+
}
1644+
} catch (WrappedResponse ex) {
1645+
countReleased++;
1646+
logger.info("Failed to register file id: " + df.getId());
1647+
Logger.getLogger(Datasets.class.getName()).log(Level.SEVERE, null, ex);
1648+
} catch (Exception e) {
1649+
logger.info("Unexpected Exception: " + e.getMessage());
1650+
}
1651+
1652+
try {
1653+
Thread.sleep(sleepInterval * 1000);
1654+
} catch (InterruptedException ie) {
1655+
logger.warning("Interrupted Exception when attempting to execute Thread.sleep()!");
1656+
}
1657+
}
1658+
1659+
logger.info(countAlreadyRegistered + " out of " + count + " files were already registered. " + new Date());
1660+
logger.info(countDrafts + " out of " + count + " files are not yet published. " + new Date());
1661+
logger.info(countReleased + " out of " + count + " unregistered, published files to register. " + new Date());
1662+
logger.info(countSuccesses + " out of " + countReleased + " unregistered, published files registered successfully. "
1663+
+ new Date());
1664+
1665+
return ok("Datafile registration complete. " + countSuccesses + " out of " + countReleased
1666+
+ " unregistered, published files registered successfully.");
1667+
}
15801668

15811669
@GET
15821670
@AuthRequired

src/main/java/edu/harvard/iq/dataverse/api/Dataverses.java

Lines changed: 66 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -82,6 +82,7 @@
8282

8383
import edu.harvard.iq.dataverse.util.json.JSONLDUtil;
8484
import edu.harvard.iq.dataverse.util.json.JsonParseException;
85+
import edu.harvard.iq.dataverse.util.json.JsonPrinter;
8586
import static edu.harvard.iq.dataverse.util.json.JsonPrinter.brief;
8687
import java.io.StringReader;
8788
import java.util.Collections;
@@ -129,6 +130,7 @@
129130
import java.util.Optional;
130131
import java.util.stream.Collectors;
131132
import javax.servlet.http.HttpServletResponse;
133+
import javax.validation.constraints.NotNull;
132134
import javax.ws.rs.WebApplicationException;
133135
import javax.ws.rs.core.Context;
134136
import javax.ws.rs.core.StreamingOutput;
@@ -166,7 +168,7 @@ public class Dataverses extends AbstractApiBean {
166168

167169
@EJB
168170
SwordServiceBean swordService;
169-
171+
170172
@POST
171173
@AuthRequired
172174
public Response addRoot(@Context ContainerRequestContext crc, String body) {
@@ -590,6 +592,69 @@ public Response deleteDataverse(@Context ContainerRequestContext crc, @PathParam
590592
}, getRequestUser(crc));
591593
}
592594

595+
/**
596+
* Endpoint to change attributes of a Dataverse collection.
597+
*
598+
* @apiNote Example curl command:
599+
* <code>curl -X PUT -d "test" http://localhost:8080/api/dataverses/$ALIAS/attribute/alias</code>
600+
* to change the alias of the collection named $ALIAS to "test".
601+
*/
602+
@PUT
603+
@AuthRequired
604+
@Path("{identifier}/attribute/{attribute}")
605+
public Response updateAttribute(@Context ContainerRequestContext crc, @PathParam("identifier") String identifier,
606+
@PathParam("attribute") String attribute, @QueryParam("value") String value) {
607+
try {
608+
Dataverse collection = findDataverseOrDie(identifier);
609+
User user = getRequestUser(crc);
610+
DataverseRequest dvRequest = createDataverseRequest(user);
611+
612+
// TODO: The cases below use hard coded strings, because we have no place for definitions of those!
613+
// They are taken from util.json.JsonParser / util.json.JsonPrinter. This shall be changed.
614+
// This also should be extended to more attributes, like the type, theme, contacts, some booleans, etc.
615+
switch (attribute) {
616+
case "alias":
617+
collection.setAlias(value);
618+
break;
619+
case "name":
620+
collection.setName(value);
621+
break;
622+
case "description":
623+
collection.setDescription(value);
624+
break;
625+
case "affiliation":
626+
collection.setAffiliation(value);
627+
break;
628+
/* commenting out the code from the draft pr #9462:
629+
case "versionPidsConduct":
630+
CollectionConduct conduct = CollectionConduct.findBy(value);
631+
if (conduct == null) {
632+
return badRequest("'" + value + "' is not one of [" +
633+
String.join(",", CollectionConduct.asList()) + "]");
634+
}
635+
collection.setDatasetVersionPidConduct(conduct);
636+
break;
637+
*/
638+
case "filePIDsEnabled":
639+
collection.setFilePIDsEnabled(parseBooleanOrDie(value));
640+
break;
641+
default:
642+
return badRequest("'" + attribute + "' is not a supported attribute");
643+
}
644+
645+
// Off to persistence layer
646+
execCommand(new UpdateDataverseCommand(collection, null, null, dvRequest, null));
647+
648+
// Also return modified collection to user
649+
return ok("Update successful", JsonPrinter.json(collection));
650+
651+
// TODO: This is an anti-pattern, necessary due to this bean being an EJB, causing very noisy and unnecessary
652+
// logging by the EJB container for bubbling exceptions. (It would be handled by the error handlers.)
653+
} catch (WrappedResponse e) {
654+
return e.getResponse();
655+
}
656+
}
657+
593658
@DELETE
594659
@AuthRequired
595660
@Path("{linkingDataverseId}/deleteLink/{linkedDataverseId}")

src/main/java/edu/harvard/iq/dataverse/datasetutility/AddReplaceFileHelper.java

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -645,7 +645,7 @@ private boolean runAddReplacePhase1(Dataset owner,
645645
df.setRootDataFileId(fileToReplace.getRootDataFileId());
646646
}
647647
// Reuse any file PID during a replace operation (if File PIDs are in use)
648-
if (systemConfig.isFilePIDsEnabled()) {
648+
if (systemConfig.isFilePIDsEnabledForCollection(owner.getOwner())) {
649649
df.setGlobalId(fileToReplace.getGlobalId());
650650
df.setGlobalIdCreateTime(fileToReplace.getGlobalIdCreateTime());
651651
// Should be true or fileToReplace wouldn't have an identifier (since it's not

src/main/java/edu/harvard/iq/dataverse/engine/command/impl/FinalizeDatasetPublicationCommand.java

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -366,7 +366,7 @@ private void publicizeExternalIdentifier(Dataset dataset, CommandContext ctxt) t
366366
String currentGlobalIdProtocol = ctxt.settings().getValueForKey(SettingsServiceBean.Key.Protocol, "");
367367
String currentGlobalAuthority = ctxt.settings().getValueForKey(SettingsServiceBean.Key.Authority, "");
368368
String dataFilePIDFormat = ctxt.settings().getValueForKey(SettingsServiceBean.Key.DataFilePIDFormat, "DEPENDENT");
369-
boolean isFilePIDsEnabled = ctxt.systemConfig().isFilePIDsEnabled();
369+
boolean isFilePIDsEnabled = ctxt.systemConfig().isFilePIDsEnabledForCollection(getDataset().getOwner());
370370
// We will skip trying to register the global identifiers for datafiles
371371
// if "dependent" file-level identifiers are requested, AND the naming
372372
// protocol, or the authority of the dataset global id is different from

0 commit comments

Comments
 (0)