Skip to content

Run data managers aggressive parallelization and refactoring.#79

Merged
rhpvorderman merged 16 commits intogalaxyproject:masterfrom
rhpvorderman:paralleldatamanagers
Mar 20, 2018
Merged

Run data managers aggressive parallelization and refactoring.#79
rhpvorderman merged 16 commits intogalaxyproject:masterfrom
rhpvorderman:paralleldatamanagers

Conversation

@rhpvorderman
Copy link
Copy Markdown
Contributor

@rhpvorderman rhpvorderman commented Mar 12, 2018

  • TODO: Rebase on master once Also update data managers #78 is pulled. I iterated on this branch because it had some improvements and I wanted to avoid merge conflicts.

While installing a few reference genomes on my galaxy I got annoyed by the indexing steps. These take quite a long time. And run-data-managers only runs one data manager at a time. I feel that job scheduling should be handled by Galaxy and not by run-data-managers so I changed the way that run-data-managers submits jobs.

Now run-data-managers first picks all the data managers that populate source tables (DEFAULT: ["all_fasta"]). Since other data managers depend on these tables. Then it runs them. After that it runs all the other data managers. Let galaxy figure out to schedule all these jobs.
This provides a significant speedup when you're adding a vertebrate genome to the list. Instead of watching your bowtie and bwa indexes be created one after another, they are now created simultaneously.

Internally I had to completely overhaul run-data-managers. It is a now a DataManagers object that has a run method. This made a lot of interfunction communication much easier. Also the code is a bit cleaner now. The DataManagers object can now also be used in other scripts.

Since I had to do some testing I overhauled the tests scripts as well. These are now split in 3 parts. The shed-tools testing was quite slow, and I did not want to wait on it all the time. There is now a separate script for testing run-data-managers which made testing a bit easier.

@galaxyproject galaxyproject deleted a comment Mar 12, 2018
@galaxyproject galaxyproject deleted a comment Mar 12, 2018
@galaxyproject galaxyproject deleted a comment Mar 12, 2018
@galaxyproject galaxyproject deleted a comment Mar 12, 2018
@rhpvorderman rhpvorderman force-pushed the paralleldatamanagers branch from 3106e2e to 4f6e389 Compare March 13, 2018 08:46
@galaxyproject galaxyproject deleted a comment Mar 13, 2018
Copy link
Copy Markdown
Member

@jmchilton jmchilton left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me, thanks!

@rhpvorderman rhpvorderman merged commit e68574b into galaxyproject:master Mar 20, 2018
@rhpvorderman rhpvorderman deleted the paralleldatamanagers branch March 20, 2018 08:40
@bgruening
Copy link
Copy Markdown
Member

Sorry, for being so late to the game. This is great, thanks a lot @rhpvorderman!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants