Run data managers aggressive parallelization and refactoring. by rhpvorderman · Pull Request #79 · galaxyproject/ephemeris

rhpvorderman · 2018-03-12T09:35:35Z

TODO: Rebase on master once Also update data managers #78 is pulled. I iterated on this branch because it had some improvements and I wanted to avoid merge conflicts.

While installing a few reference genomes on my galaxy I got annoyed by the indexing steps. These take quite a long time. And run-data-managers only runs one data manager at a time. I feel that job scheduling should be handled by Galaxy and not by run-data-managers so I changed the way that run-data-managers submits jobs.

Now run-data-managers first picks all the data managers that populate source tables (DEFAULT: ["all_fasta"]). Since other data managers depend on these tables. Then it runs them. After that it runs all the other data managers. Let galaxy figure out to schedule all these jobs.
This provides a significant speedup when you're adding a vertebrate genome to the list. Instead of watching your bowtie and bwa indexes be created one after another, they are now created simultaneously.

Internally I had to completely overhaul run-data-managers. It is a now a DataManagers object that has a run method. This made a lot of interfunction communication much easier. Also the code is a bit cleaner now. The DataManagers object can now also be used in other scripts.

Since I had to do some testing I overhauled the tests scripts as well. These are now split in 3 parts. The shed-tools testing was quite slow, and I did not want to wait on it all the time. There is now a separate script for testing run-data-managers which made testing a bit easier.

jmchilton

Looks good to me, thanks!

bgruening · 2018-03-20T08:54:49Z

Sorry, for being so late to the game. This is great, thanks a lot @rhpvorderman!

galaxyproject deleted a comment Mar 12, 2018

rhpvorderman added 16 commits March 13, 2018 09:45

start refactoring run-data-managers

28678fa

further refactoring

6f91eb5

separation work

0aa7070

added run method

9472f21

completely refactored run_data_managers

eb5d207

fix style issues

95fce5a

fixed buc

a8345f9

update documentation

188210e

flake8 issue

44a08d6

separated test scripts

7298246

fix overwrite logic

6704244

remove redundant installations

23d0142

change source tables to all_fasta only

75eafde

updated test scripts

3e089f7

remove redundant docker rm

2f8c3ca

fix codacy issue

4f6e389

rhpvorderman force-pushed the paralleldatamanagers branch from 3106e2e to 4f6e389 Compare March 13, 2018 08:46

galaxyproject deleted a comment Mar 13, 2018

jmchilton approved these changes Mar 19, 2018

View reviewed changes

rhpvorderman merged commit e68574b into galaxyproject:master Mar 20, 2018

rhpvorderman deleted the paralleldatamanagers branch March 20, 2018 08:40

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Run data managers aggressive parallelization and refactoring.#79

Run data managers aggressive parallelization and refactoring.#79
rhpvorderman merged 16 commits intogalaxyproject:masterfrom
rhpvorderman:paralleldatamanagers

rhpvorderman commented Mar 12, 2018 •

edited

Loading

Uh oh!

jmchilton left a comment

Uh oh!

bgruening commented Mar 20, 2018

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

rhpvorderman commented Mar 12, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jmchilton left a comment

Choose a reason for hiding this comment

Uh oh!

bgruening commented Mar 20, 2018

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

rhpvorderman commented Mar 12, 2018 •

edited

Loading