Add 128x128 configurations#405
Conversation
| # @package _global_ | ||
| defaults: | ||
| - ae/advection_diffusion/ae_dc_large | ||
| - /distributed: ddp_4gpu_2node_slurm |
There was a problem hiding this comment.
since this can scale to arbitrary numbers of nodes quite easily maybe we should rename it?
|
|
||
| model: | ||
| encoder: | ||
| periodic: true |
There was a problem hiding this comment.
I think in an ideal world the periodicity should really be tied to the dataset and inferred from the dataset -- it would be nice to not have to remember to specify this here. I think the code can be fairly easily extended to do this, right?
There was a problem hiding this comment.
We haven't run this but this was one of the things we were considering looking at
| # @package _global_ | ||
| defaults: | ||
| - epd/advection_diffusion/crps_vit_azula_large | ||
| - /distributed: ddp_4gpu_slurm |
There was a problem hiding this comment.
This line really shouldn't be here (likewise for the other 3 configs in this folder). There are actually two autocast bugs here to do with the resolution of distributed, which I ran into on Isambard:
-
In principle omitting this line entirely should just work because the local experiment inherits from another local experiment. However, the parent experiment's
/distributedisn't picked up correctly. I haven't looked specifically into why this is so but I'm fairly sure that it's a bug somewhere inscripts/workflow/slurm.py. -
In principle if the parent experiment already defines
/distributed, the child should specifyoverride /distributed. However, this isn't picked up in autocast. I think this should just be patched in this line
I can definitely confirm that these are bugs because when I ran these exact configs on Isambard with either no /distributed or override /distributed it would crash quite quickly as the job would only be assigned 1 GPU.
I'm fairly sure that the Hydra side of things is working perfectly correct -- it's only this custom code where we detect distributed and then use it to change the SLURM-specific things that's problematic.
There was a problem hiding this comment.
I opened #406. I think the main question for this PR is do we want to fix that bug first before merging this (I think we could conceivably test on/off isambard with --dry-run) because otherwise these configs are technically wrong
Quite self-explanatory really
Closes #401.