Skip to content
This repository was archived by the owner on Nov 14, 2024. It is now read-only.

Syncing_Directories

Alexis Lucattini edited this page Jul 21, 2023 · 10 revisions

Overview

This section entails:

  1. Using gds-sync-download
    • For syncing a gds folder with a local path
  2. Using gds-sync-upload
    • For syncing a local path with a gds folder
  3. Using gds-create-download-script
    • Generate a bash script containing the presigned urls of files recursively under a directory.
    • The bash script can be copied to any server to download the gds files.
  4. Using gds-migrate
    • Copy data from one project context to another through a manifest list and TES task.
  5. Using gds-migrate-v2
    • Copy data from one project context to another through a manifest list and TES task to a v2 project.
  6. Using gds-migrate-to-aws
    • Copy data from gds to your aws account

gds-sync-download

auto-completion: ✅

Sync a gds folder with a local directory using the temporary aws creds in a given gds folder. This function requires admin privileges in the source project.

Options:

  • --gds-path: Path to the gds folder
  • --download-path: Path to your local directory
  • --write-script-path: Path to output file containing a bash script to run the command

Requirements:

  • curl
  • jq
  • python3
  • aws

Environment vars:

  • ICA_BASE_URL
  • ICA_ACCESS_TOKEN
    • You will need to first run ica-context-switcher to get this variable into your environment

Extra info:

  • You can also use any of the aws s3 sync parameters to add to the command list, for example:

    gds-sync-download --gds-path gds://volume-name/path-to-folder/ --exclude='*' --include='*.fastq.gz'
    

    will download only fastq files from that folder.

    • If you are unsure on what files will be downloaded, use the --dryrun parameter. This will inform you of which files will be downloaded to your local file system.

    • Unlike rsync, trailing slashes on the --gds-path and --download-path do not matter. One can assume that a trailing slash exists on both parameters. This means that the contents inside the --gds-path parameter are downloaded to the contents inside --download-path

    • Despite this command being a 'download' command, you will need an 'admin' token for this command.

      • aws s3 sync requires the PutObject policy on the s3 side, regardless of the direction of the sync.
  • Use --write-script-path if the gds-sync-download installation is not in the same terminal as the command you wish to execute from.

    • A usecase may be as follows:
      • gds-sync-download is installed on your local computer but the location you wish to run the command is on an ec2 instance.
      • You may run gds-sync-download with the --write-script-path set to run-download.sh.
      • You may then upload the run-download.sh script to the ec2 instance and launch the script from there.

command-gif

gds-sync-upload

auto-completion: ✅

Sync a local directory with a gds folder using the temporary aws creds in a given gds folder.
This function requires admin privileges in the destination project.

Options:

  • --src-path: Path to your local directory
  • --gds-path: Path to the gds folder
  • --write-script-path: Path to output file containing a bash script to run the command

Requirements:

  • curl
  • jq
  • python3
  • aws

Environment vars:

  • ICA_BASE_URL
  • ICA_ACCESS_TOKEN
    • You will need to first run ica-context-switcher to get this variable into your environment

Extras:

See extras in gds-sync-download

gds-create-download-script

auto-completion: ✅

Create a script at <output_prefix>.sh that downloads all files in gds path.

Options:

  • --gds-path: Path to the gds folder (Required)
  • --output-prefix: Output file prefix (Required)

Requirements:

  • jq
  • python3

Environment vars:

  • ICA_BASE_URL
  • ICA_ACCESS_TOKEN
    • You will need to first run ica-context-switcher to get this variable into your environment

Extras:
The output file generated by this script is a bash script that uses base64 encoding to contains the following information for all files:

  • The presigned url (which expires in one week)
  • The output path of the file (relative to the output folder)
  • The e-tag
  • The file size

When files are downloaded through wget the e-tag and filesizes are calculated on the local file and compared with the value for that given file provided.

No environment variables are needed to run the output script (just jq and python3 binaries).

To prove this, in the example GIF below, the output script is first copied from the user's local directory to a fresh ec2-instance and executed on the ec2-instance (that does not have ica or ica-ica-lazy installed).

command-gif

gds-migrate

auto-completion: ✅

Copy data from one project to another project.

You can also use this command to copy data within a project.

Options:

  • --src-path: path to gds source directory
  • --src-project: name of the source project
  • --dest-path: path to gds dest directory
  • --dest-project: name of the destination project
  • --rsync-args: List of rsync args
  • --stream: Stream the input files rather than download into the task

Requirements:

  • jq
  • python3

Environment vars:

  • ICA_BASE_URL

Extras:

  • You will need at least read-only permissions in the source project and have registered an access token with ica-add-access-token.
  • You will need at least admin permissions in the destination project and have registered an access token with ica-add-access-token.
  • rsync args are in comma separated values and should be just one string, i.e --rsync-args "--include=*/,--include=*.fastq.gz,--exclude=*".
    • Quote the entire value but there is no need to add quotes or backlashes around the asterisks.
  • --stream is most efficient when the output directory size out is expected to be much smaller than the input directory.
  • Within a project, you may choose to use this command over ica folders copy as gds-migrate has the added benefit of selecting certain files with the --rsync-args command

command-gif

gds-migrate-v2

auto-completion: ✅

Copy data from a project on v1 to a project on v2.

Options:

  • --src-path: path to gds source directory
  • --src-project: name of the source project
  • --dest-path: path to gds dest directory
  • --dest-project: name of the destination project
  • --rsync-args: List of rsync args
  • --stream: Stream the input files rather than download into the task

Requirements:

  • jq
  • python3

Environment vars:

  • ICA_BASE_URL
  • ICAV2_BASE_URL (defaults to ica.illumina.com)
  • ICAV2_ACCESS_TOKEN

Extras:

  • You will need at least read-only permissions in the source project and have registered an access token with ica-add-access-token.
  • You can use the following command to extract your access token from your .icav2 directory:
    export ICAV2_ACCESS_TOKEN="$(yq eval '.access-token' ~/.icav2/.session.ica.yaml)"
    
  • You must have write access to the v2 destination project.
  • rsync args are in comma separated values and should be just one string, i.e --rsync-args "--include=*/,--include=*.fastq.gz,--exclude=*".
    • Quote the entire value but there is no need to add quotes or backlashes around the asterisks.
  • --stream is most efficient when the output directory size out is expected to be much smaller than the input directory.

command-gif

gds-migrate-to-aws --help

autocompletion ✅

Options:

  • --gds-path: path to gds source directory
  • --s3-path: path to gds dest directory
  • --stream: Use stream mode for inputs, download is default ... Additional arguments are parsed to s3-sync

Requirements:

  • aws
  • aws-sso-creds
  • jq (v1.5+)
  • python3 (v3.4+)

Environment:

  • ICA_BASE_URL
  • ICA_ACCESS_TOKEN
  • AWS_PROFILE
  • AWS_REGION

Extras:

  • You can also use any of the aws s3 sync parameters to add to the command list, for example gds-migrate-to-aws --gds-path gds://volume-name/path-to-folder/ --s3-path s3://temp-bucket/folder/ --exclude='' --include='.fastq.gz' will download only fastq files from that gds folder.

    Unlike rsync, trailing slashes on the --gds-path and --s3-path do not matter. One can assume that a trailing slash exists on both parameters. This means that the contents inside the --gds-path parameter are downloaded to the contents inside --s3-path

    You should use the --stream option if the output will be relatively small compared to the input. aws-sso-creds can be downloaded from the releases page at https://github.com/jaxxstorm/aws-sso-creds

    A token with at least read-only scope must be registered for the source path.

Example: gds-migrate-to-aws --gds-path gds://volume-name/path-to-folder/ --s3-path s3://temp-bucket/folder/ --exclude='' --include='.fastq.gz'

Next Steps

Head to the Workflow Handling page for some lazy scripts on handling workflows with ica.

Clone this wiki locally