Skip to content

Add 128x128 configurations#405

Open
penelopeysm wants to merge 2 commits into
mainfrom
py/128conf
Open

Add 128x128 configurations#405
penelopeysm wants to merge 2 commits into
mainfrom
py/128conf

Conversation

@penelopeysm

@penelopeysm penelopeysm commented Jun 4, 2026

Copy link
Copy Markdown
Member

Quite self-explanatory really

Closes #401.

# @package _global_
defaults:
- ae/advection_diffusion/ae_dc_large
- /distributed: ddp_4gpu_2node_slurm

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

since this can scale to arbitrary numbers of nodes quite easily maybe we should rename it?


model:
encoder:
periodic: true

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think in an ideal world the periodicity should really be tied to the dataset and inferred from the dataset -- it would be nice to not have to remember to specify this here. I think the code can be fairly easily extended to do this, right?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We haven't run this but this was one of the things we were considering looking at

# @package _global_
defaults:
- epd/advection_diffusion/crps_vit_azula_large
- /distributed: ddp_4gpu_slurm

@penelopeysm penelopeysm Jun 4, 2026

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This line really shouldn't be here (likewise for the other 3 configs in this folder). There are actually two autocast bugs here to do with the resolution of distributed, which I ran into on Isambard:

  • In principle omitting this line entirely should just work because the local experiment inherits from another local experiment. However, the parent experiment's /distributed isn't picked up correctly. I haven't looked specifically into why this is so but I'm fairly sure that it's a bug somewhere in scripts/workflow/slurm.py.

  • In principle if the parent experiment already defines /distributed, the child should specify override /distributed. However, this isn't picked up in autocast. I think this should just be patched in this line

val = item.get("/distributed") or item.get("distributed")

I can definitely confirm that these are bugs because when I ran these exact configs on Isambard with either no /distributed or override /distributed it would crash quite quickly as the job would only be assigned 1 GPU.

I'm fairly sure that the Hydra side of things is working perfectly correct -- it's only this custom code where we detect distributed and then use it to change the SLURM-specific things that's problematic.

@penelopeysm penelopeysm Jun 4, 2026

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I opened #406. I think the main question for this PR is do we want to fix that bug first before merging this (I think we could conceivably test on/off isambard with --dry-run) because otherwise these configs are technically wrong

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Add 128x128 configs

1 participant