Skip to content

Commit af88123

Browse files
committed
Merge branch 'develop' into 10404-fix-NPE
2 parents 79b8d89 + 3c55c3f commit af88123

16 files changed

Lines changed: 239 additions & 100 deletions

File tree

Lines changed: 5 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,5 @@
1+
Two experimental features flag called "add-publicobject-solr-field" and "avoid-expensive-solr-join" have been added to change how Solr documents are indexed for public objects and how Solr queries are constructed to accommodate access to restricted content (drafts, etc.). It is hoped that it will help with performance, especially on large instances and under load.
2+
3+
Before the search feature flag ("avoid-expensive...") can be turned on, the indexing flag must be enabled, and a full reindex performed. Otherwise publicly available objects are NOT going to be shown in search results.
4+
5+
For details see https://dataverse-guide--10555.org.readthedocs.build/en/10555/installation/config.html#feature-flags and #10555.
Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1 @@
1+
A bug that prevented the Ingest option in the File page Edit File menu from working has been fixed

doc/sphinx-guides/source/admin/harvestclients.rst

Lines changed: 5 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -47,3 +47,8 @@ What if a Run Fails?
4747
Each harvesting client run logs a separate file per run to the app server's default logging directory (``/usr/local/payara6/glassfish/domains/domain1/logs/`` unless you've changed it). Look for filenames in the format ``harvest_TARGET_YYYY_MM_DD_timestamp.log`` to get a better idea of what's going wrong.
4848

4949
Note that you'll want to run a minimum of Dataverse Software 4.6, optimally 4.18 or beyond, for the best OAI-PMH interoperability.
50+
51+
Harvesting Non-OAI-PMH
52+
~~~~~~~~~~~~~~~~~~~~~~
53+
54+
`DOI2PMH <https://github.com/IQSS/doi2pmh-server>`__ is a community-driven project intended to allow OAI-PMH harvesting from non-OAI-PMH sources.

doc/sphinx-guides/source/api/apps.rst

Lines changed: 7 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -133,6 +133,13 @@ https://github.com/libis/rdm-integration
133133
PHP
134134
---
135135

136+
DOI2PMH
137+
~~~~~~~
138+
139+
The DOI2PMH server allow Dataverse instances to harvest DOI through OAI-PMH from otherwise unharvestable sources.
140+
141+
https://github.com/IQSS/doi2pmh-server
142+
136143
OJS
137144
~~~
138145

doc/sphinx-guides/source/api/native-api.rst

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1179,7 +1179,7 @@ See also :ref:`batch-exports-through-the-api` and the note below:
11791179
export PERSISTENT_IDENTIFIER=doi:10.5072/FK2/J8SJZB
11801180
export METADATA_FORMAT=ddi
11811181
1182-
curl "$SERVER_URL/api/datasets/export?exporter=$METADATA_FORMAT&persistentId=PERSISTENT_IDENTIFIER"
1182+
curl "$SERVER_URL/api/datasets/export?exporter=$METADATA_FORMAT&persistentId=$PERSISTENT_IDENTIFIER"
11831183
11841184
The fully expanded example above (without environment variables) looks like this:
11851185

doc/sphinx-guides/source/developers/deployment.rst

Lines changed: 3 additions & 9 deletions
Original file line numberDiff line numberDiff line change
@@ -91,17 +91,11 @@ Download `ec2-create-instance.sh`_ and put it somewhere reasonable. For the purp
9191

9292
.. _ec2-create-instance.sh: https://raw.githubusercontent.com/GlobalDataverseCommunityConsortium/dataverse-ansible/master/ec2/ec2-create-instance.sh
9393

94-
To run it with default values you just need the script, but you may also want a current copy of the ansible `group vars <https://raw.githubusercontent.com/GlobalDataverseCommunityConsortium/dataverse-ansible/master/defaults/main.yml>`_ file.
94+
To run the script, you can make it executable (``chmod 755 ec2-create-instance.sh``) or run it with bash, like this with ``-h`` as an argument to print the help:
9595

96-
ec2-create-instance accepts a number of command-line switches, including:
96+
``bash ~/Downloads/ec2-create-instance.sh -h``
9797

98-
* -r: GitHub Repository URL (defaults to https://github.com/IQSS/dataverse.git)
99-
* -b: branch to build (defaults to develop)
100-
* -p: pemfile directory (defaults to $HOME)
101-
* -g: Ansible GroupVars file (if you wish to override role defaults)
102-
* -h: help (displays usage for each available option)
103-
104-
``bash ~/Downloads/ec2-create-instance.sh -b develop -r https://github.com/scholarsportal/dataverse.git -g main.yml``
98+
If you run the script without any arguments, it should spin up the latest version of Dataverse.
10599

106100
You will need to wait for 15 minutes or so until the deployment is finished, longer if you've enabled sample data and/or the API test suite. Eventually, the output should tell you how to access the Dataverse installation in a web browser or via SSH. It will also provide instructions on how to delete the instance when you are finished with it. Please be aware that AWS charges per minute for a running instance. You may also delete your instance from https://console.aws.amazon.com/console/home?region=us-east-1 .
107101

doc/sphinx-guides/source/developers/performance.rst

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -118,6 +118,10 @@ Solr
118118

119119
While in the past Solr performance hasn't been much of a concern, in recent years we've noticed performance problems when Harvard Dataverse is under load. Improvements were made in `PR #10050 <https://github.com/IQSS/dataverse/pull/10050>`_, for example.
120120

121+
We are tracking performance problems in `#10469 <https://github.com/IQSS/dataverse/issues/10469>`_.
122+
123+
In a meeting with a Solr expert on 2024-05-10 we were advised to avoid joins as much as possible. (It was acknowledged that many Solr users make use of joins because they have to, like we do, to keep some documents private.) Toward that end we have added two feature flags called ``avoid-expensive-solr-join`` and ``add-publicobject-solr-field`` as explained under :ref:`feature-flags`. It was confirmed experimentally that performing the join on all the public objects (published collections, datasets and files), i.e., the bulk of the content in the search index, was indeed very expensive, especially on a large instance the size of the IQSS prod. archive, especially under indexing load. We confirmed that it was in fact unnecessary and were able to replace it with a boolean field directly in the indexed documents, which is achieved by the two feature flags above. However, as of writing this, this mechanism should still be considered experimental.
124+
121125
Datasets with Large Numbers of Files or Versions
122126
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
123127

doc/sphinx-guides/source/installation/config.rst

Lines changed: 6 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -3268,6 +3268,12 @@ please find all known feature flags below. Any of these flags can be activated u
32683268
* - api-session-auth
32693269
- Enables API authentication via session cookie (JSESSIONID). **Caution: Enabling this feature flag exposes the installation to CSRF risks!** We expect this feature flag to be temporary (only used by frontend developers, see `#9063 <https://github.com/IQSS/dataverse/issues/9063>`_) and for the feature to be removed in the future.
32703270
- ``Off``
3271+
* - avoid-expensive-solr-join
3272+
- Changes the way Solr queries are constructed for public content (published Collections, Datasets and Files). It removes a very expensive Solr join on all such documents, improving overall performance, especially for large instances under heavy load. Before this feature flag is enabled, the corresponding indexing feature (see next feature flag) must be turned on and a full reindex performed (otherwise public objects are not going to be shown in search results). See :doc:`/admin/solr-search-index`.
3273+
- ``Off``
3274+
* - add-publicobject-solr-field
3275+
- Adds an extra boolean field `PublicObject_b:true` for public content (published Collections, Datasets and Files). Once reindexed with these fields, we can rely on it to remove a very expensive Solr join on all such documents in Solr queries, significantly improving overall performance (by enabling the feature flag above, `avoid-expensive-solr-join`). These two flags are separate so that an instance can reindex their holdings before enabling the optimization in searches, thus avoiding having their public objects temporarily disappear from search results while the reindexing is in progress.
3276+
- ``Off``
32713277

32723278
**Note:** Feature flags can be set via any `supported MicroProfile Config API source`_, e.g. the environment variable
32733279
``DATAVERSE_FEATURE_XXX`` (e.g. ``DATAVERSE_FEATURE_API_SESSION_AUTH=1``). These environment variables can be set in your shell before starting Payara. If you are using :doc:`Docker for development </container/dev-usage>`, you can set them in the `docker compose <https://docs.docker.com/compose/environment-variables/set-environment-variables/>`_ file.

src/main/java/edu/harvard/iq/dataverse/DatasetServiceBean.java

Lines changed: 27 additions & 26 deletions
Original file line numberDiff line numberDiff line change
@@ -19,8 +19,6 @@
1919
import edu.harvard.iq.dataverse.export.ExportService;
2020
import edu.harvard.iq.dataverse.globus.GlobusServiceBean;
2121
import edu.harvard.iq.dataverse.harvest.server.OAIRecordServiceBean;
22-
import edu.harvard.iq.dataverse.pidproviders.PidProvider;
23-
import edu.harvard.iq.dataverse.pidproviders.PidUtil;
2422
import edu.harvard.iq.dataverse.search.IndexServiceBean;
2523
import edu.harvard.iq.dataverse.settings.SettingsServiceBean;
2624
import edu.harvard.iq.dataverse.util.BundleUtil;
@@ -41,11 +39,10 @@
4139
import jakarta.ejb.TransactionAttributeType;
4240
import jakarta.inject.Named;
4341
import jakarta.persistence.EntityManager;
44-
import jakarta.persistence.LockModeType;
4542
import jakarta.persistence.NoResultException;
43+
import jakarta.persistence.NonUniqueResultException;
4644
import jakarta.persistence.PersistenceContext;
4745
import jakarta.persistence.Query;
48-
import jakarta.persistence.StoredProcedureQuery;
4946
import jakarta.persistence.TypedQuery;
5047
import org.apache.commons.lang3.StringUtils;
5148

@@ -115,28 +112,32 @@ public Dataset find(Object pk) {
115112
* @return a dataset with pre-fetched file objects
116113
*/
117114
public Dataset findDeep(Object pk) {
118-
return (Dataset) em.createNamedQuery("Dataset.findById")
119-
.setParameter("id", pk)
120-
// Optimization hints: retrieve all data in one query; this prevents point queries when iterating over the files
121-
.setHint("eclipselink.left-join-fetch", "o.files.ingestRequest")
122-
.setHint("eclipselink.left-join-fetch", "o.files.thumbnailForDataset")
123-
.setHint("eclipselink.left-join-fetch", "o.files.dataTables")
124-
.setHint("eclipselink.left-join-fetch", "o.files.auxiliaryFiles")
125-
.setHint("eclipselink.left-join-fetch", "o.files.ingestReports")
126-
.setHint("eclipselink.left-join-fetch", "o.files.dataFileTags")
127-
.setHint("eclipselink.left-join-fetch", "o.files.fileMetadatas")
128-
.setHint("eclipselink.left-join-fetch", "o.files.fileMetadatas.fileCategories")
129-
.setHint("eclipselink.left-join-fetch", "o.files.fileMetadatas.varGroups")
130-
//.setHint("eclipselink.left-join-fetch", "o.files.guestbookResponses
131-
.setHint("eclipselink.left-join-fetch", "o.files.embargo")
132-
.setHint("eclipselink.left-join-fetch", "o.files.retention")
133-
.setHint("eclipselink.left-join-fetch", "o.files.fileAccessRequests")
134-
.setHint("eclipselink.left-join-fetch", "o.files.owner")
135-
.setHint("eclipselink.left-join-fetch", "o.files.releaseUser")
136-
.setHint("eclipselink.left-join-fetch", "o.files.creator")
137-
.setHint("eclipselink.left-join-fetch", "o.files.alternativePersistentIndentifiers")
138-
.setHint("eclipselink.left-join-fetch", "o.files.roleAssignments")
139-
.getSingleResult();
115+
try {
116+
return (Dataset) em.createNamedQuery("Dataset.findById")
117+
.setParameter("id", pk)
118+
// Optimization hints: retrieve all data in one query; this prevents point queries when iterating over the files
119+
.setHint("eclipselink.left-join-fetch", "o.files.ingestRequest")
120+
.setHint("eclipselink.left-join-fetch", "o.files.thumbnailForDataset")
121+
.setHint("eclipselink.left-join-fetch", "o.files.dataTables")
122+
.setHint("eclipselink.left-join-fetch", "o.files.auxiliaryFiles")
123+
.setHint("eclipselink.left-join-fetch", "o.files.ingestReports")
124+
.setHint("eclipselink.left-join-fetch", "o.files.dataFileTags")
125+
.setHint("eclipselink.left-join-fetch", "o.files.fileMetadatas")
126+
.setHint("eclipselink.left-join-fetch", "o.files.fileMetadatas.fileCategories")
127+
.setHint("eclipselink.left-join-fetch", "o.files.fileMetadatas.varGroups")
128+
//.setHint("eclipselink.left-join-fetch", "o.files.guestbookResponses
129+
.setHint("eclipselink.left-join-fetch", "o.files.embargo")
130+
.setHint("eclipselink.left-join-fetch", "o.files.retention")
131+
.setHint("eclipselink.left-join-fetch", "o.files.fileAccessRequests")
132+
.setHint("eclipselink.left-join-fetch", "o.files.owner")
133+
.setHint("eclipselink.left-join-fetch", "o.files.releaseUser")
134+
.setHint("eclipselink.left-join-fetch", "o.files.creator")
135+
.setHint("eclipselink.left-join-fetch", "o.files.alternativePersistentIndentifiers")
136+
.setHint("eclipselink.left-join-fetch", "o.files.roleAssignments")
137+
.getSingleResult();
138+
} catch (NoResultException | NonUniqueResultException ex) {
139+
return null;
140+
}
140141
}
141142

142143
public List<Dataset> findByOwnerId(Long ownerId) {

src/main/java/edu/harvard/iq/dataverse/FilePage.java

Lines changed: 10 additions & 11 deletions
Original file line numberDiff line numberDiff line change
@@ -522,10 +522,9 @@ public String ingestFile() throws CommandException{
522522
return null;
523523
}
524524

525-
DataFile dataFile = fileMetadata.getDataFile();
526-
editDataset = dataFile.getOwner();
525+
editDataset = file.getOwner();
527526

528-
if (dataFile.isTabularData()) {
527+
if (file.isTabularData()) {
529528
JH.addMessage(FacesMessage.SEVERITY_WARN, BundleUtil.getStringFromBundle("file.ingest.alreadyIngestedWarning"));
530529
return null;
531530
}
@@ -537,25 +536,25 @@ public String ingestFile() throws CommandException{
537536
return null;
538537
}
539538

540-
if (!FileUtil.canIngestAsTabular(dataFile)) {
539+
if (!FileUtil.canIngestAsTabular(file)) {
541540
JH.addMessage(FacesMessage.SEVERITY_WARN, BundleUtil.getStringFromBundle("file.ingest.cantIngestFileWarning"));
542541
return null;
543542

544543
}
545544

546-
dataFile.SetIngestScheduled();
545+
file.SetIngestScheduled();
547546

548-
if (dataFile.getIngestRequest() == null) {
549-
dataFile.setIngestRequest(new IngestRequest(dataFile));
547+
if (file.getIngestRequest() == null) {
548+
file.setIngestRequest(new IngestRequest(file));
550549
}
551550

552-
dataFile.getIngestRequest().setForceTypeCheck(true);
551+
file.getIngestRequest().setForceTypeCheck(true);
553552

554553
// update the datafile, to save the newIngest request in the database:
555554
datafileService.save(file);
556555

557556
// queue the data ingest job for asynchronous execution:
558-
String status = ingestService.startIngestJobs(editDataset.getId(), new ArrayList<>(Arrays.asList(dataFile)), (AuthenticatedUser) session.getUser());
557+
String status = ingestService.startIngestJobs(editDataset.getId(), new ArrayList<>(Arrays.asList(file)), (AuthenticatedUser) session.getUser());
559558

560559
if (!StringUtil.isEmpty(status)) {
561560
// This most likely indicates some sort of a problem (for example,
@@ -565,9 +564,9 @@ public String ingestFile() throws CommandException{
565564
// successfully gone through the process of trying to schedule the
566565
// ingest job...
567566

568-
logger.warning("Ingest Status for file: " + dataFile.getId() + " : " + status);
567+
logger.warning("Ingest Status for file: " + file.getId() + " : " + status);
569568
}
570-
logger.fine("File: " + dataFile.getId() + " ingest queued");
569+
logger.fine("File: " + file.getId() + " ingest queued");
571570

572571
init();
573572
JsfHelper.addInfoMessage(BundleUtil.getStringFromBundle("file.ingest.ingestQueued"));

0 commit comments

Comments
 (0)