Some components of the bdbag software are configured via JSON-formatted configuration files.
There are two global configuration files: bdbag.json and keychain.json. Skeleton versions of these files with simple default values are automatically created in the current user's home directory the first time a bag is created or opened.
Additionally, three JSON-formatted configuration files can be passed as arguments to bdbag in order to supply input for certain bag creation and update functions. These files are known as metadata, ro metadata and remote-file-manifest configurations.
The file bdbag.json is a global configuration file that allows the user to specify a set of parameters to be used as
defaults when performing various bag manipulation functions.
The format of bdbag.json is a single JSON object containing a set of JSON child objects (used as
configuration sub-sections) which control various default behaviors of the software.
This is the parent object for the entire configuration.
| Parameter | Description |
|---|---|
bdbag_config_version |
The version number of the configuration file. In general, it matches the release version number of bdbag |
bag_config |
This object contains all bag-related configuration parameters. |
fetch_config |
This object contains all fetch-related configuration parameters. |
resolver_config |
This object contains all implementation-specific resolver configuration parameters. |
identifier_resolvers |
This is a global list of identifier "meta" resolvers. It can be overridden on a per-resolver basis via the individual configuration blocks for each resolver in resolver_config. |
This object contains all bag-related configuration parameters.
| Parameter | Description |
|---|---|
bag_algorithms |
This is an array of strings representing the default checksum algorithms to use for bag manifests, if not otherwise specified. Valid values are "md5", "sha1", "sha256", and "sha512". |
bag_archiver |
This is a string representing the default archiving format to use if not otherwise specified. Valid values are "zip", "tar", and "tgz". |
bag_metadata |
This is a list of simple JSON key-value pairs that will be written as-is to bag-info.txt. |
bag_processes |
This is a numeric value representing the default number of concurrent processes to use when calculating checksums. |
bagit_spec_version |
The version of the bagit specification that created bags will conform to. Valid values are "0.97" or "1.0". |
bag_archive_idempotent |
A boolean value indicating that idempotent mode should be used by default when creating and archiving new bags. |
The fetch_config object contains a set of child objects each keyed by the scheme of the transport protocol that contains the transport handler configuration parameters.
There is a default set of transport handlers installed with bdbag. In addition, bdbag supports
externally implemented transport handlers that can be plugged-in (i.e., declared as run-time imports) via the fetch_config configuration object in the bdbag.json config file.
This requires developers to perform some integration tasks.
Developers should create a class deriving from bdbag.fetch.transports.base_transport.BaseFetchTransport
and implement three required functions:
-
__init__(self, config, keychain, **kwargs): The class constructor. Derived classes should first callsuper(<derived class name>, self).__init__(config, keychain, **kwargs)which sets theconfig,keychain, andkwargsvariables as class member variables with the same names. -
fetch(self, url, output_path, **kwargs): This method should implement the logic required to transfer a file referenced byurlto the local path referenced byoutput_path. The**kwargsargument is an extensible argument dictionary that the framework may populate with extra data, for example: an integer argumentsizemay present (if it can be found infetch.txtfor a given fetch entry), representing the expected size of the remote file in bytes. -
cleanup(self): This method should implement any transport-specific release of resources. Note this function will be called only once per-transport at the end of a entire bag fetch, and not once per-file.
Configure the usage of the external transport via the fetch_config object of the bdbag.json configuration file. The fetch_config object is comprised of child configuration objects keyed by a lowercase string value representing the URL protocol scheme that is being configured. When configuring an external handler, the following applies:
-
There is a single required top-level string parameter with the key name
handlerwhich maps to the fully-qualified class name implementing the required methods of thebdbag.fetch.transports.base_transport.BaseFetchTransportbase class. At runtime the bdbag fetch framework code will attempt to load this class viaimportlib.import_modulemachinery and if successful, it will be instantiated and returned to the bdbag fetch framework code and the class instance cached for the duration of the bag fetch operation. Subsequently, whenever a URL is encountered infetch.txtwith a protocol scheme matching that of the installed handler, that handler'sfetchmethod will be invoked. -
There is also an optional string parameter,
allow_keychain, which must be present and evaluate totruein order to toggle the propagation of the bdbagkeychaininto the handler code during the__init__call. If theallow_keychainparameter is missing or set to any other value that cannot be evaluated as a Python booleanTrue, then the value of thekeychainvariable passed to the__init__call will beNone. In general, if the custom handler code has its own mechanism for managing credentials, then this parameter may be omitted. If the handler intends to make use of the bdbagkeychainthat is currently in context for the current user and fetch operation, then this parameter must be present and evaluate toTrue. -
The remainder of the protocol scheme handler configuration object can consist of any valid JSON; the entire object value assigned to the scheme key will be passed as the
configparameter to the__init__method of the custom handler.
For example, given the following fetch_config section:
{
"fetch_config": {
"s3": {
"handler":"my.custom.S3Transport",
"max_read_retries": 5,
"read_chunk_size": 10485760,
"read_timeout_seconds": 120
},
"foo": {
"handler":"my.custom.FooTransport",
"allow_keychain": true,
"my_foo_complex_config": {
"bar":[
"a","b","c"
],
"baz":{
"xyz":123
}
}
}
}
}
For the scheme foo, the following object will be passed as the config parameter to the __init__ method of my.custom.FooTransport upon class instantiation:
{
"handler":"my.custom.FooTransport",
"allow_keychain": true,
"my_foo_complex_config": {
"bar":[
"a","b","c"
],
"baz":{
"xyz":123
}
}
}
Currently, only the default http, https and s3 transport handlers have configuration objects that control their behavior.
| Parameter | Description |
|---|---|
http |
Configuration for the http fetch handler. |
https |
Configuration for the https fetch handler. |
s3 |
Configuration for the s3 fetch handler. |
This object contains configuration parameters for the http fetch handler.
| Parameter | Description |
|---|---|
session_config |
Session configuration parameters for the requests HTTP client library. The parameters mainly control retry logic. |
http_cookies |
Configuration parameters for automatic loading and merging of HTTP cookie files. |
allow_redirects |
A boolean indicating that redirects should automatically be followed, or not. |
redirect_status_codes |
An array of integers representing the HTTP status codes used for determining redirection. Defaults to [301, 302, 303, 307, 308]. |
Session configuration parameters for the requests HTTP client library. The parameters mainly control retry logic. The retry logic is provided via the urllib3 library, wrapped by requests.
For more infomation, see this external documentation.
| Parameter | Description |
|---|---|
retry_backoff_factor |
The exponential backoff factor for all retry attempts. Defaults to 1.0. |
retry_connect |
The number of connect attempts to retry. Defaults to 5. |
retry_read |
The number of read attempts to retry. Defaults to 5. |
retry_status_forcelist |
A list of HTTP response codes that will force and automatic retry. Defaults to: [500,502,503,504]. |
Configuration parameters for automatic loading and merging of HTTP cookie files. These cookie files must follow the Mozilla/Netscape/CURL/WGET format as described here.
| Parameter | Description |
|---|---|
scan_for_cookie_files |
A boolean value that enables/disables the cookie scan feature globally. Defaults to True (enabled). |
search_paths |
An array of base directory paths from which to recursively search with search_paths_filter for file_names to use as input. Defaults to the system-dependent expansion of ~. |
search_paths_filter |
An fnmatch.filter pattern that can be used to filter specific subdirectories of each path specified in search_paths. Defaults to .bdbag. |
file_names |
An array of input cookie filenames or fnmatch.filter patterns to match cookie filenames against. Defaults to [*cookies.txt]. |
This object contains configuration parameters for the https fetch handler. The https fetch handler configuration is
identical to the http fetch handler configuration, with the following exceptions:
| Parameter | Description |
|---|---|
bypass_ssl_cert_verification |
Either the boolean value true or false, or an array of string values consisting of URL patterns to be used in a simple substring match against the target URLs found in a bag's fetch.txt file. For example, "bypass_ssl_cert_verification": ["https://raw.githubusercontent.com/fair-research/bdbag/"] will match a fetch.txt entry with a URL of "https://raw.githubusercontent.com/fair-research/bdbag/master/test/test-data/test-http/test-fetch-http.txt". Defaults to false. |
It is NOT RECOMMENDED setting bypass_ssl_cert_verification: true as it will bypass SSL certificate validation for
ALL HTTPS requests. This will accept any TLS certificate presented by a remote server and will ignore hostname mismatches
and/or expired certificates, which will make the application vulnerable to man-in-the-middle (MitM) attacks.
This object contains configuration parameters for the s3 fetch handler.
| Parameter | Description |
|---|---|
max_read_retries |
Maximum number of socket read retries. Defaults to 5. |
read_chunk_size |
Number of bytes to consume per read attempt. Defaults to 10485760 bytes (10MB). |
read_timeout_seconds |
Timeout in seconds per read attempt. Defaults to 120. |
This object contains all implementation-specific resolver configuration parameters, keyed by resolver scheme. The current default handlers schemes are: [ark, minid, doi, and ga4ghdos].
Each scheme can have multiple resolver configuration blocks in an array, where each block can be mapped to a different resolver namespace prefix.
| Parameter | Description |
|---|---|
handler |
This is the fully-qualified Python class name of a class derived from bdbag.fetch.resolvers.base_resolver.BaseResolverHandler and implementing the required functions. The bdbag resolver code will attempt to locate and instantiate this class at runtime. |
prefix |
This is an optional parameter that maps the handler resolution to only instances that contain the specific prefix found in the identifier. |
identifier_resolvers |
This is the same parameter as the global identifier_resolvers array. If found at this level, it will override the global setting for this scheme/prefix combination. |
Below is a sample bdbag.json file:
{
"bag_config": {
"bag_algorithms": [
"md5",
"sha256"
],
"bag_metadata": {
"BagIt-Profile-Identifier": "https://raw.githubusercontent.com/fair-research/bdbag/master/profiles/bdbag-profile.json",
"Contact-Name": "mdarcy",
"Contact-Orcid": "0000-0003-2280-917X"
},
"bag_processes": 1,
"bagit_spec_version": "0.97"
},
"bdbag_config_version": "1.5.0",
"fetch_config": {
"http": {
"session_config": {
"retry_backoff_factor": 1.0,
"retry_connect": 5,
"retry_read": 5,
"retry_status_forcelist": [
500,
502,
503,
504
]
},
"http_cookies": {
"file_names": [
"*cookies.txt"
],
"scan_for_cookie_files": true,
"search_paths": [
"/home/mdarcy"
],
"search_paths_filter": ".bdbag"
}
},
"s3": {
"max_read_retries": 5,
"read_chunk_size": 10485760,
"read_timeout_seconds": 120
}
},
"identifier_resolvers": [
"n2t.net",
"identifiers.org"
],
"resolver_config": {
"ark": [
{
"identifier_resolvers": [
"n2t.net",
"identifiers.org"
],
"prefix": null
},
{
"handler": "bdbag.fetch.resolvers.ark_resolver.MinidResolverHandler",
"identifier_resolvers": [
"n2t.net",
"identifiers.org"
],
"prefix": "57799"
},
{
"handler": "bdbag.fetch.resolvers.ark_resolver.MinidResolverHandler",
"identifier_resolvers": [
"n2t.net",
"identifiers.org"
],
"prefix": "99999/fk4"
}
],
"doi": [
{
"handler": "bdbag.fetch.resolvers.doi_resolver.DOIResolverHandler",
"identifier_resolvers": [
"n2t.net",
"identifiers.org"
],
"prefix": "10.23725/"
}
],
"ga4ghdos": [
{
"handler": "bdbag.fetch.resolvers.dataguid_resolver.DataGUIDResolverHandler",
"identifier_resolvers": [
"n2t.net"
],
"prefix": "dg.4503/"
}
],
"minid": [
{
"handler": "bdbag.fetch.resolvers.ark_resolver.MinidResolverHandler",
"identifier_resolvers": [
"n2t.net",
"identifiers.org"
]
}
]
}
}The file keychain.json is used to specify the authentication mechanisms and credentials for the various URLs that might
be encountered while trying to resolve (download) the files listed in a bag's fetch.txt file.
The format of keychain.json is a JSON array containing a list of JSON objects, each of which specify a set of parameters used to
configure the authentication method and credentials to use for a specifed base URL.
| Parameter | Description |
|---|---|
uri |
This is the base URI used to specify when authentication should be used. When a URI reference is encountered in fetch.txt, an attempt will be made to match it against all base URIs specified in keychain.json and if a match is found, the request will be authenticated before file retrieval is attempted. |
auth_uri |
This is the authentication URI used to establish an authenticated session for the specified uri. This is currently assumed to be an HTTP(s) protocol URL. |
auth_type |
This is the authentication type used by the server specified by uri or auth_uri (if present). |
auth_params |
This is a child object containing authentication-type specific parameters used in session establishment. It will generally contain credential information such as a username and password, a cookie value, or client certificate parameters. It can also contain other parameters required for authentication with the given auth_type mechanism; for example the HTTP method (i.e., GET or POST) to use with HTTP Basic Auth. |
Below is a sample keychain.json file:
[
{
"uri": "https://some.host.com/somefiles/",
"auth_uri": "https://some.host.com/authenticate",
"auth_type": "http-form",
"auth_params": {
"username": "me",
"password": "mypassword",
"username_field": "username",
"password_field": "password"
}
},
{
"uri": "https://some.host.com/somefiles/",
"auth_uri": "https://some.host.com/authenticate",
"auth_type": "http-basic",
"auth_params": {
"auth_method":"POST",
"username": "me",
"password": "mypassword"
}
},
{
"uri": "https://some.host.com/somefiles/",
"auth_type": "cookie",
"auth_params": {
"cookies": [ "a_cookie_name=zxyfw1231_secret"]
}
},
{
"uri": "https://some.host.com/somefiles/",
"auth_type": "bearer-token",
"auth_params": {
"token": "<token>",
"allow_redirects_with_token": true
}
},
{
"uri": "ftp://some.host.com/somefiles/",
"auth_type": "ftp-basic",
"auth_params": {
"username": "anonymous",
"password": "bdbag@users.noreply.github.com"
}
},
{
"uri": "s3://mybucket",
"auth_type": "aws-credentials",
"auth_params": {
"key": "foo",
"secret": "bar"
}
},
{
"uri": "gs://gcs-bdbag-integration-testing/",
"auth_type": "gcs-credentials",
"auth_params": {
"project_id": "bdbag-204999",
"allow_requester_pays": true
}
},
{
"uri": "gs://bdbag-dev/",
"auth_type": "gcs-credentials",
"auth_params": {
"project_id": "bdbag-204999",
"service_account_credentials_file": "/home/bdbag/bdbag-204400-41babdd46e24.json"
}
},
{
"uri": "globus://my_endpoint/my_files/",
"auth_type": "globus_transfer",
"auth_params": {
"local_endpoint": "b06c5a10-0b17-11e7-a73f-22000bf2d559",
"transfer_token": "AQBXNMizAAAAAAADPIg9SoyPk_dm0BOFcWT7pe-52fQKv2Je6zi-hEvJ5xkfXw8rLaL9mVg8RtOY-vy4qrQd"
}
}
]A remote-file-manifest configuration file is used by bdbag during bag creation and update as a way
to include files in a bag that are not necesarily present on the local system, and therefore cannot be hashed.
The file is processed by bdbag and the data used to generate both payload manifest entries and fetch.txt
entries in the result bag.
The remote-file-manifest is structured as a JSON array containing a list of JSON objects that have the following attributes:
url: The url where the file can be located or dereferenced from. This value MUST be present.length: The length of the file in bytes. This value MUST be present.filename: The filename (or path), relative to the bag 'data' directory as it will be referenced in the bag manifest(s) and fetch.txt files. This value MUST be present.- AT LEAST one (and ONLY one of each) of the following
algorithm:checksumkey-value pairs:md5:<md5 hex value>sha1:<sha1 hex value>sha256:<sha256 hex value>sha512:<sha512 hex value>
- Other legal JSON keys and values of arbitrary complexity MAY be included, as long as the basic requirements of the structure (as described above) are fulfilled.
Below is a sample remote-file-manifest configuration file:
[
{
"url":"https://raw.githubusercontent.com/fair-research/bdbag/master/profiles/bdbag-profile.json",
"length":699,
"filename":"bdbag-profile.json",
"sha256":"eb42cbc9682e953a03fe83c5297093d95eec045e814517a4e891437b9b993139"
},
{
"url":"ark:/88120/r8059v",
"length": 632860,
"filename": "minid_v0.1_Nov_2015.pdf",
"sha256": "cacc1abf711425d3c554277a5989df269cefaa906d27f1aaa72205d30224ed5f"
}
]A bag-info metadata configuration file consists of a single JSON object containing a set of JSON key-value pairs that will be
written as-is to the bag's bag-info.txt file. NOTE: per the bagit specification, strings are the only supported value type in bag-info.txt.
Below is a sample bag-info metadata configuration file:
{
"BagIt-Profile-Identifier": "https://raw.githubusercontent.com/fair-research/bdbag/master/profiles/bdbag-profile.json",
"External-Description": "Simple bdbag test",
"Arbitrary-Metadata-Field": "This is completely arbitrary"
}A Research Object metadata configuration file consists of a single JSON object containing a set of JSON key-object pairs where
the key is a / delimited relative file path and the object is any aribitratily complex JSON content. This format allows
bdbag to process all RO metadata as an aggregation which can then be serialized into individual JSON file components relative
to the bag's metadata directory.
NOTE: while this documentation refers to this configuration file as a ro metadata file,
the contents of this configuration file only have to conform to the bagit-ro
conventions if bagit-ro compatibility is the goal. Otherwise, this mechanism can be used as a generic way to create any number of
arbitrary JSON (or JSON-LD) metadata files as bagit tagfiles.
Below is a sample ro metadata configuration file:
{
"manifest.json": {
"@context": [ "https://w3id.org/bundle/context" ],
"@id": "../",
"createdOn": "2018-02-08T12:23:00Z",
"aggregates": [
{ "uri": "../data/CTD_chem_gene_ixn_types.csv",
"mediatype": "text/csv"
},
{ "uri": "../data/CTD_chemicals.csv",
"mediatype": "text/csv"
},
{ "uri": "../data/CTD_pathways.csv",
"mediatype": "text/csv"
}
],
"annotations": [
{ "about": "../data/CTD_chem_gene_ixn_types.csv",
"content": "annotations/CTD_chem_gene_ixn_types.csv.jsonld"
}
]
},
"annotations/CTD_chem_gene_ixn_types.csv.jsonld": {
"@context": {
"schema": "http://schema.org/",
"object": "schema:object",
"TypeName": {
"@type": "schema:name",
"@id": "schema:name"
},
"Code": {
"@type": "schema:code",
"@id": "schema:code"
},
"Description": {
"@type": "schema:description",
"@id": "schema:description"
},
"ParentCode": {
"@type": "schema:code",
"@id": "schema:parentItem"
},
"results": {
"@id": "schema:object",
"@type": "schema:object",
"@container": "@set"
}
}
}
}