Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
85 commits
Select commit Hold shift + click to select a range
1ab8b57
add debug index logging
qqmyers Mar 25, 2025
8cf78c4
use loop constants, etc.
qqmyers Mar 25, 2025
1e43490
minimize work when details false, check restrict earlier/simplier
qqmyers Mar 25, 2025
3b746f7
really fix test
qqmyers Mar 25, 2025
f23a274
simplify - fix restrict bug
qqmyers Mar 25, 2025
8f89906
release note
qqmyers Mar 26, 2025
17cd5b5
fix compile issue, additional tweaks
qqmyers Mar 26, 2025
fb36f3b
try parallel file loop
qqmyers Mar 27, 2025
a8e5476
fix NPE and final issues
qqmyers Mar 27, 2025
646bb83
try finddeep
qqmyers Mar 27, 2025
612e521
avoid double loop
qqmyers Mar 28, 2025
0d6f7be
diff by query
qqmyers Mar 28, 2025
e2d4e98
numeric params
qqmyers Mar 28, 2025
85425e2
fix merge issues, change doFullText logic
qqmyers Mar 28, 2025
985227b
formatting
qqmyers Mar 28, 2025
3d2c408
restore indexing of released files
qqmyers Mar 28, 2025
a649937
delay getting dataset until semaphore is available
qqmyers Mar 28, 2025
1b2548a
restore transaction, don't finddeep
qqmyers Mar 28, 2025
9deef72
simplify ToU logic
qqmyers Mar 28, 2025
9e5ea00
avoid keeping files in List
qqmyers Mar 28, 2025
b7924a3
change dataset case too
qqmyers Mar 28, 2025
6f6e32e
avoid variableservice
qqmyers Mar 28, 2025
dfbf603
try EAGER
qqmyers Mar 28, 2025
7e508b6
avoid isTabularData
qqmyers Mar 28, 2025
7296db3
restore indexing new files in first versions
qqmyers Mar 29, 2025
7b2c3e0
revert to loop, add try around datatable part
qqmyers Mar 30, 2025
d94bbdc
messed merge
qqmyers Mar 31, 2025
3eed1b4
Shift dataset-level constants out of loops
qqmyers Mar 30, 2025
9cc7ba4
Calculate desired cards once
qqmyers Mar 30, 2025
829826b
calc datasetVersionsToBuildCardsFor once
qqmyers Mar 30, 2025
2aa2709
Custom permission query for filedownloaders
qqmyers Mar 30, 2025
a1f624d
replace findDvObjectPerms
qqmyers Mar 30, 2025
6611cfd
typo
qqmyers Mar 30, 2025
80a3fb2
cache up front
qqmyers Mar 30, 2025
4cd01b8
avoid duplicate loop over datasetVersions
qqmyers Mar 30, 2025
c3ab56f
let exceptions bubble up
qqmyers Mar 31, 2025
15a3f95
remove deprecated always true boolean
qqmyers Mar 31, 2025
857b474
avoid duplicate doc generation
qqmyers Mar 31, 2025
54f4f41
add debug timing, remove unused code
qqmyers Mar 31, 2025
04895f8
fix logic
qqmyers Mar 31, 2025
e5e9fd4
use named query, log no cache case (should never be true now)
qqmyers Mar 31, 2025
79493df
move query
qqmyers Mar 31, 2025
b1233ab
drop String.class
qqmyers Mar 31, 2025
9550fc2
try result set mapping
qqmyers Mar 31, 2025
dc01500
numeric params
qqmyers Mar 31, 2025
e85cd22
test bypassing query
qqmyers Apr 1, 2025
0469d35
query hardcode to role 2
qqmyers Apr 1, 2025
3a6faee
typos
qqmyers Apr 1, 2025
81d54a7
use array and any
qqmyers Apr 1, 2025
eadb0b4
add per batch logging
qqmyers Apr 1, 2025
f542e55
try stream
qqmyers Apr 1, 2025
dbca955
remove eager on datatable
qqmyers Apr 1, 2025
82fe5df
revert to original query
qqmyers Apr 1, 2025
1ece7f1
one version at a time in construct docs
qqmyers Apr 1, 2025
ba13dda
avoid getting list of filemetadatas
qqmyers Apr 1, 2025
49e7d32
typo
qqmyers Apr 1, 2025
eff01f9
fix pub date source
qqmyers Apr 1, 2025
6b6fc54
try all files
qqmyers Apr 2, 2025
cf03071
try cache increases
qqmyers Apr 2, 2025
946035b
remove coord protocol
qqmyers Apr 2, 2025
afeeb39
limit at 1K
qqmyers Apr 2, 2025
b92489f
try weak on files/md
qqmyers Apr 2, 2025
0fcd064
add file proxy
qqmyers Apr 2, 2025
93c4e69
stream, cleanup feature flag
qqmyers Apr 2, 2025
7c817b9
make the jvm option optional
qqmyers Apr 2, 2025
2a6e9f3
merge fix
qqmyers Apr 2, 2025
2f87415
DvObj missed changes
qqmyers Apr 2, 2025
1bfe78d
cleanup
qqmyers Apr 2, 2025
60ec76e
Merge remote-tracking branch 'IQSS/develop' into solr-index-improvements
qqmyers Apr 3, 2025
36b4efb
cleanup, remove restricted ft code from QDR
qqmyers Apr 3, 2025
c508ec6
make named queries
qqmyers Apr 3, 2025
474c3b2
try stream, remove sync blocks from parallel test
qqmyers Apr 3, 2025
82043d1
docs and setting updates
qqmyers Apr 3, 2025
da1b631
sync query mapping and constructor
qqmyers Apr 3, 2025
2db1625
named query, back to asc order
qqmyers Apr 3, 2025
db8791e
query fix
qqmyers Apr 3, 2025
7cf09a6
lengthen hard commit time
qqmyers Apr 3, 2025
750974f
remove unused query
qqmyers Apr 3, 2025
2dc56e1
revert hard commit change
qqmyers Apr 3, 2025
862197d
remove shared cache from persistence.xml
qqmyers Apr 4, 2025
ac32815
Revert "revert hard commit change"
qqmyers Apr 4, 2025
cff9848
update query to recurse to permissionroot
qqmyers Apr 4, 2025
e4e39d4
fix mapping to long
qqmyers Apr 4, 2025
9c36fcf
flip recursion
qqmyers Apr 5, 2025
359c153
Merge remote-tracking branch 'IQSS/develop' into solr-index-improvements
qqmyers May 19, 2025
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion conf/solr/solrconfig.xml
Original file line number Diff line number Diff line change
Expand Up @@ -238,7 +238,7 @@
have some sort of hard autoCommit to limit the log size.
-->
<autoCommit>
<maxTime>${solr.autoCommit.maxTime:30000}</maxTime>
<maxTime>${solr.autoCommit.maxTime:300000}</maxTime>
<openSearcher>false</openSearcher>
</autoCommit>

Expand Down
5 changes: 5 additions & 0 deletions doc/release-notes/11374-indexing-improvement.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,5 @@
### Solr Indexing speed improved

The performance of Solr indexing has been significantly improved, particularly for datasets with many files.

A new dataverse.solr.min-files-to-use-proxy microprofile setting can be used to further improve performance/lower memory requirements for datasets with many files (e.g. 500+) (defaults to Integer.MAX, disabling use of the new functionality)
11 changes: 11 additions & 0 deletions doc/sphinx-guides/source/installation/config.rst
Original file line number Diff line number Diff line change
Expand Up @@ -2689,6 +2689,17 @@ when using it to configure your core name!

Can also be set via *MicroProfile Config API* sources, e.g. the environment variable ``DATAVERSE_SOLR_PATH``.

dataverse.solr.min-files-to-use-proxy
+++++++++++++++++++++++++++++++++++++

Specifies when to use a smaller datafile proxy object for the purposes of dataset indexing. This can lower memory requirements
and improve performance when reindexing large datasets (e.g. those with hundreds or thousands of files). (Creating the proxy may slightly slow indexing datasets with only a few files.)

This setting represents a number of files for which the datafile procy should be used. By default, this is set to Interger.MAX which disables using the proxy.
A recommended value would be ~1000 but the optimal value may vary depending on details of your installation.

Can also be set via *MicroProfile Config API* sources, e.g. the environment variable ``DATAVERSE_SOLR_MIN_FILES_TO_USE_PROXY``.

dataverse.solr.concurrency.max-async-indexes
++++++++++++++++++++++++++++++++++++++++++++

Expand Down
22 changes: 22 additions & 0 deletions src/main/java/edu/harvard/iq/dataverse/DataFile.java
Original file line number Diff line number Diff line change
Expand Up @@ -13,6 +13,7 @@
import edu.harvard.iq.dataverse.datasetutility.FileSizeChecker;
import edu.harvard.iq.dataverse.ingest.IngestReport;
import edu.harvard.iq.dataverse.ingest.IngestRequest;
import edu.harvard.iq.dataverse.search.SolrIndexServiceBean;
import edu.harvard.iq.dataverse.util.BundleUtil;
import edu.harvard.iq.dataverse.util.FileUtil;
import edu.harvard.iq.dataverse.util.ShapefileHandler;
Expand All @@ -23,6 +24,7 @@
import java.util.Objects;
import java.text.SimpleDateFormat;
import java.util.Arrays;
import java.util.Date;
import java.util.HashMap;
import java.util.Map;
import java.util.Set;
Expand Down Expand Up @@ -50,6 +52,26 @@
@NamedQuery(name="DataFile.findDataFileThatReplacedId",
query="SELECT s.id FROM DataFile s WHERE s.previousDataFileId=:identifier")
})
@NamedNativeQuery(
name = "DataFile.getDataFileInfoForPermissionIndexing",
query = "SELECT fm.label, df.id, dvo.publicationDate " +
"FROM filemetadata fm " +
"JOIN datafile df ON fm.datafile_id = df.id " +
"JOIN dvobject dvo ON df.id = dvo.id " +
"WHERE fm.datasetversion_id = ?",
resultSetMapping = "DataFileInfoMapping"
)
@SqlResultSetMapping(
name = "DataFileInfoMapping",
classes = @ConstructorResult(
targetClass = SolrIndexServiceBean.DataFileProxy.class,
columns = {
@ColumnResult(name = "label", type = String.class),
@ColumnResult(name = "id", type = Long.class),
@ColumnResult(name = "publicationDate", type = Date.class)
}
)
)
@Entity
@Table(indexes = {@Index(columnList="ingeststatus")
, @Index(columnList="checksumvalue")
Expand Down
20 changes: 20 additions & 0 deletions src/main/java/edu/harvard/iq/dataverse/Dataset.java
Original file line number Diff line number Diff line change
Expand Up @@ -20,17 +20,20 @@
import java.util.Objects;
import java.util.Set;
import jakarta.persistence.CascadeType;
import jakarta.persistence.ColumnResult;
import jakarta.persistence.Entity;
import jakarta.persistence.Index;
import jakarta.persistence.JoinColumn;
import jakarta.persistence.ManyToOne;
import jakarta.persistence.NamedNativeQuery;
import jakarta.persistence.NamedQueries;
import jakarta.persistence.NamedQuery;
import jakarta.persistence.NamedStoredProcedureQuery;
import jakarta.persistence.OneToMany;
import jakarta.persistence.OneToOne;
import jakarta.persistence.OrderBy;
import jakarta.persistence.ParameterMode;
import jakarta.persistence.SqlResultSetMapping;
import jakarta.persistence.StoredProcedureParameter;
import jakarta.persistence.Table;
import jakarta.persistence.Temporal;
Expand Down Expand Up @@ -71,6 +74,23 @@
@NamedQuery(name = "Dataset.countAll",
query = "SELECT COUNT(ds) FROM Dataset ds")
})
@NamedNativeQuery(
name = "Dataset.findAllOrSubsetOrderByFilesOwned",
query = "SELECT DISTINCT CAST(o.id AS BIGINT) as id, COUNT(f.id) as numFiles " +
"FROM dvobject o " +
"LEFT JOIN dvobject f ON f.owner_id = o.id " +
"WHERE o.dtype = 'Dataset' " +
"AND (? = false OR o.indexTime IS NULL) " +
"GROUP BY o.id " +
"ORDER BY numfiles ASC, id",
resultSetMapping = "DatasetIdMapping"
)
@SqlResultSetMapping(
name = "DatasetIdMapping",
columns = {
@ColumnResult(name = "id", type = Long.class)
}
)

/*
Below is the database stored procedure for getting a string dataset id.
Expand Down
29 changes: 3 additions & 26 deletions src/main/java/edu/harvard/iq/dataverse/DatasetServiceBean.java
Original file line number Diff line number Diff line change
Expand Up @@ -279,32 +279,9 @@ public List<Long> findAllOrSubsetOrderByFilesOwned(boolean skipIndexed) {
SEK - 11/09/2021
*/

String skipClause = skipIndexed ? "AND o.indexTime is null " : "";
Query query = em.createNativeQuery(" Select distinct(o.id), count(f.id) as numFiles FROM dvobject o " +
"left join dvobject f on f.owner_id = o.id where o.dtype = 'Dataset' "
+ skipClause
+ " group by o.id "
+ "ORDER BY count(f.id) asc, o.id");

List<Object[]> queryResults;
queryResults = query.getResultList();

List<Long> retVal = new ArrayList();
for (Object[] result : queryResults) {
Long dsId;
if (result[0] != null) {
try {
dsId = Long.parseLong(result[0].toString()) ;
} catch (Exception ex) {
dsId = null;
}
if (dsId == null) {
continue;
}
retVal.add(dsId);
}
}
return retVal;
return em.createNamedQuery("Dataset.findAllOrSubsetOrderByFilesOwned", Long.class)
.setParameter(1, skipIndexed)
.getResultList();
}

/**
Expand Down
36 changes: 36 additions & 0 deletions src/main/java/edu/harvard/iq/dataverse/FileMetadata.java
Original file line number Diff line number Diff line change
Expand Up @@ -25,6 +25,7 @@
import jakarta.json.Json;
import jakarta.json.JsonArrayBuilder;
import jakarta.persistence.Column;
import jakarta.persistence.ColumnResult;
import jakarta.persistence.Entity;
import jakarta.persistence.GeneratedValue;
import jakarta.persistence.GenerationType;
Expand All @@ -35,8 +36,10 @@
import jakarta.persistence.JoinTable;
import jakarta.persistence.ManyToMany;
import jakarta.persistence.ManyToOne;
import jakarta.persistence.NamedNativeQuery;
import jakarta.persistence.OneToMany;
import jakarta.persistence.OrderBy;
import jakarta.persistence.SqlResultSetMapping;
import jakarta.persistence.Table;
import jakarta.persistence.Transient;
import jakarta.persistence.Version;
Expand All @@ -62,6 +65,39 @@
* @author skraffmiller
*/
@Table(indexes = {@Index(columnList="datafile_id"), @Index(columnList="datasetversion_id")} )
@NamedNativeQuery(
name = "FileMetadata.compareFileMetadata",
query = "WITH fm_categories AS (" +
" SELECT fmd.filemetadatas_id, " +
" STRING_AGG(dfc.name, ',' ORDER BY dfc.name) AS categories " +
" FROM FileMetadata_DataFileCategory fmd " +
" JOIN DataFileCategory dfc ON fmd.filecategories_id = dfc.id " +
" GROUP BY fmd.filemetadatas_id " +
") " +
"SELECT fm1.id " +
"FROM FileMetadata fm1 " +
"LEFT JOIN FileMetadata fm2 ON fm1.datafile_id = fm2.datafile_id " +
" AND fm2.datasetversion_id = ?1 " +
"LEFT JOIN fm_categories fc1 ON fc1.filemetadatas_id = fm1.id " +
"LEFT JOIN fm_categories fc2 ON fc2.filemetadatas_id = fm2.id " +
"WHERE fm1.datasetversion_id = ?2 " +
" AND (fm2.id IS NULL " +
" OR (fm1.datafile_id = fm2.datafile_id " +
" AND (fm2.description IS DISTINCT FROM fm1.description " +
" OR fm2.directoryLabel IS DISTINCT FROM fm1.directoryLabel " +
" OR fm2.label != fm1.label " +
" OR fm2.restricted IS DISTINCT FROM fm1.restricted " +
" OR fm2.prov_freeform IS DISTINCT FROM fm1.prov_freeform " +
" OR fc1.categories IS DISTINCT FROM fc2.categories " +
" ) " +
" ) " +
" )",
resultSetMapping = "IdToLongMapping"
)
@SqlResultSetMapping(
name = "IdToLongMapping",
columns = @ColumnResult(name = "id", type = Long.class)
)
@Entity
public class FileMetadata implements Serializable {
private static final long serialVersionUID = 1L;
Expand Down
68 changes: 35 additions & 33 deletions src/main/java/edu/harvard/iq/dataverse/FileVersionDifference.java
Original file line number Diff line number Diff line change
Expand Up @@ -64,26 +64,41 @@ When there are changes (after v4.19)to the file metadata data model this method

if (newFileMetadata.getDataFile() == null && originalFileMetadata != null){
//File Deleted
updateDifferenceSummary("", BundleUtil.getStringFromBundle("file.versionDifferences.fileGroupTitle"), 0, 0, 1, 0);
if (details) {
updateDifferenceSummary("", BundleUtil.getStringFromBundle("file.versionDifferences.fileGroupTitle"), 0, 0, 1, 0);
}
return false;
}
if (this.originalFileMetadata == null && this.newFileMetadata.getDataFile() != null ){

if (this.originalFileMetadata == null && this.newFileMetadata.getDataFile() != null){
//File Added
if (!details) return false;
retVal = false;
updateDifferenceSummary( "", BundleUtil.getStringFromBundle("file.versionDifferences.fileGroupTitle"), 1, 0, 0, 0);
}

//Check to see if File replaced
if (originalFileMetadata != null &&
newFileMetadata.getDataFile() != null && originalFileMetadata.getDataFile() != null &&!this.originalFileMetadata.getDataFile().equals(this.newFileMetadata.getDataFile())){
if (!details) return false;
updateDifferenceSummary( "", BundleUtil.getStringFromBundle("file.versionDifferences.fileGroupTitle"), 0, 0, 0, 1);
if (!details) {
return false;
}
retVal = false;
updateDifferenceSummary("", BundleUtil.getStringFromBundle("file.versionDifferences.fileGroupTitle"), 1, 0, 0, 0);
}

if ( originalFileMetadata != null) {
if (originalFileMetadata != null) {
// Check to see if File replaced
if (newFileMetadata.getDataFile() != null && originalFileMetadata.getDataFile() != null && !this.originalFileMetadata.getDataFile().equals(this.newFileMetadata.getDataFile())) {
if (!details)
return false;
updateDifferenceSummary("", BundleUtil.getStringFromBundle("file.versionDifferences.fileGroupTitle"), 0, 0, 0, 1);
retVal = false;
}

/*
* Get Restriction Differences
*/
if (originalFileMetadata.isRestricted() != newFileMetadata.isRestricted()) {
if (details) {
String value2 = newFileMetadata.isRestricted() ? BundleUtil.getStringFromBundle("file.versionDifferences.fileRestricted") : BundleUtil.getStringFromBundle("file.versionDifferences.fileUnrestricted");
updateDifferenceSummary(BundleUtil.getStringFromBundle("file.versionDifferences.fileAccessTitle"), value2, 0, 0, 0, 0);
}
retVal = false;
}

if (!newFileMetadata.getLabel().equals(originalFileMetadata.getLabel())) {
if (details) {
differenceDetailItems.add(new FileDifferenceDetailItem(BundleUtil.getStringFromBundle("file.versionDifferences.fileNameDetailTitle"), originalFileMetadata.getLabel(), newFileMetadata.getLabel()));
Expand All @@ -94,10 +109,8 @@ When there are changes (after v4.19)to the file metadata data model this method
BundleUtil.getStringFromBundle("file.versionDifferences.fileNameDetailTitle"), 0, 1, 0, 0);
retVal = false;
}
}

//Description differences
if ( originalFileMetadata != null) {
//Description differences
if (newFileMetadata.getDescription() != null
&& originalFileMetadata.getDescription() != null
&& !newFileMetadata.getDescription().equals(originalFileMetadata.getDescription())) {
Expand Down Expand Up @@ -134,9 +147,7 @@ When there are changes (after v4.19)to the file metadata data model this method
BundleUtil.getStringFromBundle("file.versionDifferences.descriptionDetailTitle"), 0, 0, 1, 0);
retVal = false;
}
}
//Provenance Description differences
if ( originalFileMetadata != null) {
//Provenance Description differences
if ((newFileMetadata.getProvFreeForm() != null && !newFileMetadata.getProvFreeForm().isEmpty())
&& (originalFileMetadata.getProvFreeForm() != null && !originalFileMetadata.getProvFreeForm().isEmpty())
&& !newFileMetadata.getProvFreeForm().equals(originalFileMetadata.getProvFreeForm())) {
Expand Down Expand Up @@ -173,8 +184,6 @@ When there are changes (after v4.19)to the file metadata data model this method
BundleUtil.getStringFromBundle("file.versionDifferences.provenanceDetailTitle"), 0, 0, 1, 0);
retVal = false;
}
}
if (originalFileMetadata != null) {
/*
get Tags differences
*/
Expand All @@ -188,7 +197,9 @@ When there are changes (after v4.19)to the file metadata data model this method
}

if (!value1.equals(value2)) {
if (!details) return false;
if (!details) {
return false;
}
int added = 0;
int deleted = 0;

Expand Down Expand Up @@ -223,16 +234,7 @@ When there are changes (after v4.19)to the file metadata data model this method
}
retVal = false;
}

/*
Get Restriction Differences
*/
value1 = originalFileMetadata.isRestricted() ? BundleUtil.getStringFromBundle("file.versionDifferences.fileRestricted") : BundleUtil.getStringFromBundle("file.versionDifferences.fileUnrestricted");
value2 = newFileMetadata.isRestricted() ? BundleUtil.getStringFromBundle("file.versionDifferences.fileRestricted") : BundleUtil.getStringFromBundle("file.versionDifferences.fileUnrestricted");
if (!value1.equals(value2)) {
updateDifferenceSummary(BundleUtil.getStringFromBundle("file.versionDifferences.fileAccessTitle"), value2, 0, 0, 0, 0);
retVal = false;
}

}
return retVal;
}
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -2,6 +2,7 @@

import edu.harvard.iq.dataverse.authorization.AuthenticationServiceBean;
import edu.harvard.iq.dataverse.authorization.DataverseRole;
import edu.harvard.iq.dataverse.authorization.Permission;
import edu.harvard.iq.dataverse.authorization.RoleAssignee;
import edu.harvard.iq.dataverse.authorization.groups.Group;
import edu.harvard.iq.dataverse.authorization.groups.GroupServiceBean;
Expand All @@ -27,6 +28,7 @@
import jakarta.ejb.Stateless;
import jakarta.inject.Named;
import jakarta.persistence.EntityManager;
import jakarta.persistence.NamedNativeQuery;
import jakarta.persistence.PersistenceContext;
import org.apache.commons.lang3.StringUtils;

Expand Down Expand Up @@ -395,6 +397,15 @@ public List<RoleAssignee> filterRoleAssignees(String query, DvObject dvObject, L

return roleAssigneeList;
}


public List<String> findAssigneesWithPermissionOnDvObject(Long objectId, Permission permission) {
int bitpos = 63 - permission.ordinal();
return em.createNamedQuery("RoleAssignment.findAssigneesWithPermissionOnDvObject", String.class)
.setParameter(1, bitpos)
.setParameter(2, objectId)
.getResultList();
}

private void msg(String s) {
//System.out.println(s);
Expand Down
Loading