Common Workflow Language (CWL) is used to define heterogeneous computational workflows in a way, that ensure reproducibility. Heterogeneous workflows can include steps implemented in different programming languages (we demonstrate Python and R) and/or steps performed by calling Unix/Linux command line utilities.
CWL implementations ensure that id a workflow runs on your laptop it will run in any other environment, such as HPC slurm cluster and produce exactly the same results (maybe a bit faster).
This sample workflow demonstrates how to:
- Download files, using standard unix/Linux command line utility
curl.- The goal of this step is to demonstrate reusable CWL command line tool.
- Manipulate with files (e.g. unizp).
- The goal of this step is to demonstrate using adhoc command line utilities
- Process files using Python package
- This step demonstrates how to implement processing in Python
- Process files using an arbitrary tool, using an R script as an example
- This step demonstrates how one can use R, or, in fact any scripting
utility, such as
sedorawk.
- This step demonstrates how one can use R, or, in fact any scripting
utility, such as
This workflow does not demonstrate an important CWL feature: scattering. This feature allows running multiple processes in parallel.
For details, see CWL User Guide and NSAPH gridMET processing workflow that scatters processing over parameters (bands) and over years
The workflow is tested to work in Conda environment, using Anaconda for Mac 4.10.3. While a conda environment file is provided for your reference, I would recommend setting up the environment manually. This is because certain dependencies should be installed in particular order, especially geospatial libraries.
Please follow these steps:
- Clone the repository
- Create a new conda environment
- Install Python 3.8 and R
- Install geopandas
- Install Python dependencies
- Create an empty directory and cd there
- Look at the workflow graph (result)
- Run the Workflow
- Compare the results with expected results
Setup Commands:
git clone https://github.com/fasrc/epa_cwl_airflow.git
cd epa_cwl_airflow
conda create --name fasrccwl
conda activate fasrccwl
conda install python=3.8 R
conda install geopandas
pip install -r requirements.txt
Generate workflow graph:
export sourceroot=$(pwd)
cwl-runner --print-dot $sourceroot/workflow/cwl/workflow.cwl | dot -Tgif > workflow_graph.gif
Run the workflow:
export sourceroot=$(pwd)
cd $workdir
cwl-runner --parallel $sourceroot/workflow/cwl/workflow.cwl
ls -alF
cat *
diff -r . $sourceroot/reusults