Skip to content

DVUploader, a Command line Bulk Uploader for Dataverse

qqmyers edited this page Oct 29, 2018 · 23 revisions

Motivation:

Dataverse supports file uploads through its web interface. However, that interface has a limit of 1000 files per upload session and, since it displays uploaded files in a single long list, it becomes difficult to use with fewer files than that. The web interface's support for unzipping zip files is one way to simplify the process - files can be pre-zipped for upload as one larger zip file - but the interface still shows a long list of the included files.

The Dataverse community has a number of initiatives underway to support upload of larger files (greater than a few GB) and/or large numbers of files. Many of these involve configuring external storage and/or data transfer software. One, whose development was supported by TDL, is a relatively simple application (DVUploader) that can be downloaded by users. It uses the existing Dataverse application programming interface (API) to upload files from a specified directory into a specified Dataset. It can be a useful alternative to the web interface when:

  • there are hundreds or thousands of files to upload,
  • when automatic verification of error-free and complete upload of files is desired,
  • when new files are being generated/added to a directory and Dataverse needs to be updated with just the new files,
  • uploading of files needs to be automated, e.g. added to an instrument or analysis script or program. The DVUploader does need to be installed and, as a command-line tool, may not be as intuitive as the Dataverse web interface. However, unlike other bulk tools being developed, it will work with any Dataverse installation without any server-side changes. (Since it does upload and store data via Dataverse, it shares the basic performance and performance limitations of Dataverse's web interface. Other tools bypass Dataverse to handle larger data or do not move data from their remote locations and simply reference it in a Dataverse Dataset.) DVUploader can thus be a useful tool for individuals and for Dataverse installations interested in supporting larger numbers of files.

Installation

The DVUploader is a Java application packaged as a single jar file. For the current version (1.0.0):

Step 1: Install Java (if needed) (DVUploader requires version 8 or greater - download for most operating systems is available from https://java.com/en/download/). Any warning about Java not being able to run in the user's browser (MS Edge is one where this warning is shown) can be ignored as the DVUploader does not run in the browser. Step 2: Copy the jar file (below) to a directory on your computer. The other two files below enable the DVUploader to log its use to a local file. They should be downloaded to the same directory as the jar. (If they are not downloaded, the user will see a printed warning but the DVUploader will run.)

Uploading Files

To prepare: Use Dataverse to:

  • find the DOI for the dataset you wish to add files to, and
  • generate an API key for yourself in the Dataverse instance you are using (from the popup menu under your profile).

The simplest way to run the DVUploader is to place the jar file and logging files downloaded above into the directory containing a subdirectory with the files intended for upload. (The DVUploader can be placed anywhere on disk and can upload files from any directory, but this requires adding these paths to the command line and/or configuration of Java's classpath.)

REQUIRED: Run the jar with the following command line:

java -jar DVUploader-1.0.0.jar -key=<api key> -did=<dataset doi> -server=<server URL> <dir or file names> where

<apikey> is replaced with the API Key generated by the user in Dataverse

<dataset doi> is replaced with the DOI of the target Dataset

<serverURL> is replaced by the URL of the Dataverse server being used (with no trailing '/' and do not include any path to a specific Dataverse on the server), and

<dir or file names> is replaced by the name of a directory and/or a list of individual files to upload.

These four arguments are always required. There are additional options listed below. **Note: **For a first test, adding -listonly is useful - it will make the DVUploader list what it would do, but will not perform any uploads.

For example, java -jar DVUploader-1.0.0.jar -key=8599b802-659e-49ef-823c-20abd8efc05c -did=doi:10.5072/FK2/TUNNVE -server=https://dataverse.tdl.org testdir would upload all of the files in the 'testdir' directory (relative to the current directory where the java command is run) to the Dataset at https://dataverse.tdl.org/dataset.xhtml?persistentId=10.5072/FK2/TUNNVE (if it existed: the dataset in this example is not real).

The output from the DVUploader looks like:

Dataverse Mode: Uploading files to a Dataverse instance Using apiKey: 8599b802-659e-49ef-823c-20abd8efc05c Adding content to: doi:10.5072/FK2/TUNNVE Using server: https://dataverse.tdl.org Request to upload: testdir

PROCESSING(C): testdir Found as: doi:10.5072/FK2/TUNNVE

PROCESSING(D): testdir\Capture3.JPG Does not yet exist on server. UPLOADED as: MD5:b2d8726f4ddba30705259143dbb283e3 CURRENT TOTAL: 1 files :9506 bytes

PROCESSING(D): testdir\Capture4.GIF Does not yet exist on server. UPLOADED as: MD5:3b9b536bd0abaf9c2677846f62d77ed9 CURRENT TOTAL: 2 files :23973 bytes

PROCESSING(D): testdir\Capture5.PNG Does not yet exist on server. UPLOADED as: MD5:ce26585c19bd1470b7229b2cfcc879f0 CURRENT TOTAL: 3 files :35448 bytes

(The same information is written into a log file.)

Optional Parameters

The full set of available command-line arguments are shown in the example below.

java -cp .\DVUploadClient.jar;. org.sead.acr.client.DVUploader -key=<api key> -did=<dataset doi> -server=<server URL> <-listonly> <-limit=<X>> <-ex=<ext>> < -verify> -recurse <dir or file names>

(Note all combinations should work, but not all have been tested together.)

-listonly: write information about what would/would not be transferred without doing any uploads. Useful as a testing/debugging option and in combination with the -verify flag as discussed below

-limit=<X>: limit this run to at most <X> data file uploads. Repeatedly running the uploader with, for example -limit=5, will upload five more files at a time. This can also be useful for testing, or as a way to break uploads into chunks as part of an automated workflow.

-ex=<ext>: exclude any file that matches the provided regular expression pattern, e.g. -ex=^..* (exclude files that start with a period) -ex=*.txt (exclude all files ending in .txt). Multiple repeats of this flag can be used to exclude based on multiple patterns. A common use for this flag would be to avoid uploading resource files (which start with a period) on MacOS.

-verify : use the cryptographic hash generated by Dataverse (usually MD5, but configurable now to SHA-1 and in the future to SHA-256 or SHA-512) and verify that the corresponding hash of the local file matches. This can be used with Dataverse to verify transfers as they occur, or, used with the -listonly flag, can be used in a second pass to verify that all files previously uploaded match the current file system contents

-recurse : Upload files from subdirectories of the listed directory(ies). Note that since Dataverse does not support folders, you data files will be uploaded without path information into the Dataset. This could cause issues, if, for example, you have files with the same name or content in different subdirectories.

-maxlockwait=<X> : - the maximum time to wait (in seconds) for a Dataset lock (i.e. while the last file is ingested) to expire (default 60 seconds)

Dataverse Requirements:

DVUploader uses the native API of Dataverse and will work with v4.8.4 through v4.9.4. Until the file upload changes, this version of DVUploader should continue to work with newer Dataverse versions. In 4.9+ DVUploader uses the new lock API to more robustly wait for the lock during ingest to expire.

Frequently Asked Questions:

Can I upload a whole directory tree?

Yes, using the -recurse flag described above. Given that Dataverse currently doesn't support folders within datasets, the Uploader does not currently support nested subdirectories. If you supply a directory name (e.g. test) which contains a sub-dir, e.g. test/subdir, any files in /test/subdir will be ignored by default. However, if having all of your files appear as one flat list in a Dataset is acceptable, you can run the uploader with the -recurse flag or to only upload some subdirectories, don't specify -recurse and instead provide a list, e.g. java … testdir testdir/subdir1 testdir/subdir2 Note that if there are files with the same name or content in these directories, Dataverse may fail to upload them or may modify their names (e.g. file_1.txt). (The DVUploader will do the same thing as if you had uploaded all files via the Dataverse web application.)

Java cannot find the DVUploader / Can I put the jar file in one place and not move it to upload different directories?

Yes. The examples shown use Windows-style paths. The DVUploader is a standard Java application so, as long as you configure your classpath and use paths in identifying directories to upload, you can put the DVUploader jar where you like and run it from any directory. In the command-line examples given above, you can change the -jar DVUploader-1.0.0.jar to -jar /DVUploaderClient.jar for Unix. You can also add the jar to your default Java classpath to avoid having to type the classpath on the command line.

The DVUploader was stopped before it finished, what do I do?

The DVUploader can just be restarted. It will scan through the existing files and start uploading when it finds the first one that does not exist in the Dataset.

Is the DVUploader Open Source?

Yes. The DVUploader is distributed under the Apache2 Open Source License. The source code has been posted to GitHub and is distributed as a Dataverse community product.

I see 'waiting' messages or errors from Dataverse!

Problems with the apikey, server URL, or Dataset DOI should be discovered and reported early. If a required argument is missing, the DVUploader will display usage information.Any problems occurring as files are uploaded should relate to the specifics of that file, or, if you see 'Waiting' messages, to the one before it. The DVUploader uses the Dataverse API to upload files, so any problems that could occur using the web interface can occur for the DVUploader as well. Specifically, issues related to data size (upload size limits), network connections (failures, connections timing out), or to Dataverse-specific operations, such as two files with the same content not being allowed can occur. When uploading files such as zip files or spreadsheets that are further processed by Dataverse, you may see errors such as the file already existing (i.e. if you upload an Excel file for which a .tab file has already been uploaded). Further when Dataverse ingests a file, it places a lock on the Dataset until the processing is done. DVUploader attempts to wait for such a lock to be removed before proceeding to upload the next file, but it only waits for 60 seconds by default. (On 4.8.x instances of Dataverse, it also cannot tell whether the Dataset is locked due to ingest or for another cause such as being 'in review'). If you see an error uploading a file after one where the DVUploader was 'waiting', try increasing the -maxlockwait setting. In all cases, it can be useful to try uploading any file for which the DVUploader reported an error through Dataverse's web interface.

I'd like the DVUploader to do 'X'...

Great! Tell your Dataverse administrators who can help communicate your request to the larger community, or help develop it yourself. (The DVUploader leverages code originally developed as part of the SEAD project and there are a number of features that have not yet been ported to Dataverse including the ability to create a new Dataset and upload metadata that would not have to be built from scratch.)

Clone this wiki locally