Fixing flash attn in ddp diffusion model by Jubeku · Pull Request #2002 · ecmwf/WeatherGenerator

Jubeku · 2026-03-06T16:42:58Z

Description

When running multi-gpu training with fe_diffusion_model: True we get illegal memory access error caused during back propagation in the varlen attention.

In this PR we tried to implement following fixes, however the issue persists.

adding .contiguous() to all q/k/v tensors before they enter flash attention,
using the training model's frozen encoder directly, no separate encoder copy is initialized,
detaching encoded target tokens before passing it to the target aux output.

Not within this PR, we implemented Torch's SDPA as an alternative to the varlen attention. This indeed fixed the error, but it not what we want because of the memory overhead. Still, if the problem persists, we might use it as an intermediate solution to run diffusion experiments in multi-gpu setting.

Issue Number

Fixes #1999

Is this PR a draft? Mark it as draft.

Checklist before asking for review

I have performed a self-review of my code
My changes comply with basic sanity checks:
- I have fixed formatting issues with ./scripts/actions.sh lint
- I have run unit tests with ./scripts/actions.sh unit-test
- I have documented my code and I have updated the docstrings.
- I have added unit tests, if relevant
I have tried my changes with data and code:
- I have run the integration tests with ./scripts/actions.sh integration-test
- (bigger changes) I have run a full training and I have written in the comment the run_id(s): launch-slurm.py --time 60
- (bigger changes and experiments) I have shared a hegdedoc in the github issue with all the configurations and runs for this experiments
I have informed and aligned with people impacted by my change:
- for config changes: the MatterMost channels and/or a design doc
- for changes of dependencies: the MatterMost software development channel

Jubeku added 2 commits March 6, 2026 11:57

make flash attn entries contigious

47a87f2

pass targets directly through model.encoder

c246953

github-project-automation bot added this to WeatherGen-dev Mar 6, 2026

github-actions bot added the model Related to model training or definition (not generic infra) label Mar 9, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fixing flash attn in ddp diffusion model#2002

Fixing flash attn in ddp diffusion model#2002
Jubeku wants to merge 2 commits intomk/mh/diffusion-single-samplefrom
jk/mk/mh/diffusion-single-sample-fix-ddp

Jubeku commented Mar 6, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

Jubeku commented Mar 6, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Issue Number

Checklist before asking for review

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Jubeku commented Mar 6, 2026 •

edited

Loading