Fix GPT2 attention scaling ignored in SDPA/FlashAttention by OiPunk · Pull Request #44397 · huggingface/transformers

OiPunk · 2026-03-02T16:14:37Z

What does this PR do?

GPT2Attention.forward() did not pass the scaling parameter to attention_interface, causing scale_attn_weights and scale_attn_by_inverse_layer_idx config options to be silently ignored when using SDPA or FlashAttention backends.

The eager attention implementation (eager_attention_forward) reads these flags directly from the module and applies scaling correctly. However, SDPA and FlashAttention rely on the scaling parameter passed to the attention function call — which GPT2 never provided.

Changes

modeling_gpt2.py: Compute self.scaling in GPT2Attention.__init__ by combining scale_attn_weights (1/√d_k) and scale_attn_by_inverse_layer_idx (1/(layer_idx+1)), following the same pattern used by LLaMA and other models.
modeling_gpt2.py: Pass scaling=self.scaling to attention_interface() in forward().
test_modeling_gpt2.py: Add test_gpt2_sdpa_matches_eager_with_scaling_configs that verifies SDPA and eager produce equivalent outputs when using non-default scaling configs.

Behavior table (before → after)

Config	Eager	SDPA before fix	SDPA after fix
`scale_attn_weights=True` (default)	÷√d_k	÷√d_k (PyTorch default, coincidental match)	÷√d_k ✓
`scale_attn_weights=False`	No scaling	÷√d_k (wrong)	No scaling ✓
`scale_attn_by_inverse_layer_idx=True`	÷(layer+1)	Ignored	÷(layer+1) ✓

…ends GPT2Attention.forward() did not pass the `scaling` parameter to `attention_interface`, causing `scale_attn_weights` and `scale_attn_by_inverse_layer_idx` config options to be silently ignored when using SDPA or FlashAttention backends. Compute the combined scaling factor in __init__ (following the pattern used by LLaMA and other models) and forward it to the attention interface so all backends produce consistent results. Fixes huggingface#44380

…2Attention) Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

vasqu

Some smaller comments but we got lucky on the default case tbh. Very good finding

vasqu · 2026-03-02T19:05:11Z

src/transformers/models/gpt2/modeling_gpt2.py

+        if self.scale_attn_weights:
+            self.scaling = self.head_dim**-0.5


I think we got lucky here because SDPA and FA will default to exactly this

vasqu · 2026-03-02T19:05:26Z

src/transformers/models/gpt2/modeling_gpt2.py

+        if self.scale_attn_by_inverse_layer_idx:
+            self.scaling /= float(self.layer_idx + 1)


This is what was silently ignored then

vasqu · 2026-03-02T19:06:18Z

src/transformers/models/gpt2/modeling_gpt2.py

Imo, we can just use self.scaling in the eager forward as well then and copy from bert or similar (meaning the eager forward). No need for extra treatment then

Thanks for the review! Addressed in the latest commit (04f9ba9):

self.scaling is computed once in __init__ and used in both the _upcast_and_reordered_attn path (via baddbmm alpha) and the standard attention_interface path (via scaling=self.scaling).

The old per-forward scale factor computation in _upcast_and_reordered_attn is removed.

So no extra treatment — just self.scaling everywhere, same pattern as bert.

Let's also address https://github.com/OiPunk/transformers/blob/04f9ba9ff0e3de8a8b21c801ba79509328ff14da/src/transformers/models/gpt2/modeling_gpt2.py#L54-L79?target=https://github.com then

We can be closer to Bert since we unify into self.scaling instead https://github.com/OiPunk/transformers/blob/04f9ba9ff0e3de8a8b21c801ba79509328ff14da/src/transformers/models/bert/modeling_bert.py#L115-L140?target=https://github.com

vasqu · 2026-03-02T19:08:02Z

tests/models/gpt2/test_modeling_gpt2.py

        )
        result.loss.backward()

+    def test_gpt2_sdpa_matches_eager_with_scaling_configs(self):


Could we also check FA

Added! The FA test is at test_gpt2_fa2_matches_eager_with_scaling_configs — both tests use model.set_attn_implementation() to switch backends without reloading.

vasqu · 2026-03-02T19:08:49Z

tests/models/gpt2/test_modeling_gpt2.py

+        config.scale_attn_by_inverse_layer_idx = True
+
+        # Eager attention (known-correct reference)
+        config._attn_implementation = "eager"


I'd rather use this after init, i.e. model.set_attn_implementation("eager")

vasqu · 2026-03-02T19:08:58Z

tests/models/gpt2/test_modeling_gpt2.py

+            output_eager = model_eager(input_ids, token_type_ids=token_type_ids).logits
+
+        # SDPA attention (was buggy: ignored scaling configs)
+        config._attn_implementation = "sdpa"


vasqu · 2026-03-02T19:09:37Z

tests/models/gpt2/test_modeling_gpt2.py

+        model_sdpa = GPT2LMHeadModel(config).to(torch_device).eval()
+        model_sdpa.load_state_dict(model_eager.state_dict())


We don't need this reloading stuff; we just switch up the flags with the set_attn...

…ve tests Per reviewer feedback: - Replace inline scale_factor computation with self.scaling in _upcast_and_reordered_attn for both GPT2 and DecisionTransformer - Use model.set_attn_implementation() instead of model reloading in tests - Add FlashAttention2 vs eager comparison test

OiPunk · 2026-03-03T00:15:33Z

@vasqu Thanks for the review! Addressed all feedback:

self.scaling in _upcast_and_reordered_attn: Replaced the inline scale_factor computation with self.scaling in both GPT2 and DecisionTransformer. Now all attention paths consistently use self.scaling computed once in __init__.
set_attn_implementation in tests: Switched from model reloading with config._attn_implementation to model.set_attn_implementation("eager")/"sdpa" — single model, flag swap.
FlashAttention2 test: Added test_gpt2_fa2_matches_eager_with_scaling_configs gated behind @require_torch_accelerator and @require_flash_attn.

github-actions · 2026-03-03T00:16:49Z

[For maintainers] Suggested jobs to run (before merge)

run-slow: decision_transformer, gpt2

vasqu

Thanks for iterating, left another round of smaller comments to simplify a bit / follow some conventions

vasqu · 2026-03-03T11:04:47Z

tests/models/gpt2/test_modeling_gpt2.py

+        with torch.no_grad():
+            output_eager = model(input_ids, token_type_ids=token_type_ids).logits
+
+        # FlashAttention2


Suggested change

# FlashAttention2

# Flash Attention 2 (was buggy: ignored scaling configs)

vasqu · 2026-03-03T11:08:21Z

tests/models/gpt2/test_modeling_gpt2.py

+        with torch.no_grad():
+            output_fa2 = model(input_ids, token_type_ids=token_type_ids).logits
+
+        torch.testing.assert_close(output_eager, output_fa2, atol=5e-3, rtol=5e-3)


Suggested change

torch.testing.assert_close(output_eager, output_fa2, atol=5e-3, rtol=5e-3)

torch.testing.assert_close(output_eager, output_fa2, atol=1e-2, rtol=1e-2)

Any reason for that atol/rtol? I remember FA being flaky at times so raising it would be personally preferred

vasqu · 2026-03-03T11:08:41Z

tests/models/gpt2/test_modeling_gpt2.py

+        with torch.no_grad():
+            output_sdpa = model(input_ids, token_type_ids=token_type_ids).logits
+
+        torch.testing.assert_close(output_eager, output_sdpa, atol=1e-5, rtol=1e-4)


Suggested change

torch.testing.assert_close(output_eager, output_sdpa, atol=1e-5, rtol=1e-4)

torch.testing.assert_close(output_eager, output_sdpa, atol=1e-4, rtol=1e-4)

same here, just a small raise to avoid flakiness

vasqu · 2026-03-03T11:10:37Z

src/transformers/models/gpt2/modeling_gpt2.py

Let's also address https://github.com/OiPunk/transformers/blob/04f9ba9ff0e3de8a8b21c801ba79509328ff14da/src/transformers/models/gpt2/modeling_gpt2.py#L54-L79?target=https://github.com then

We can be closer to Bert since we unify into self.scaling instead https://github.com/OiPunk/transformers/blob/04f9ba9ff0e3de8a8b21c801ba79509328ff14da/src/transformers/models/bert/modeling_bert.py#L115-L140?target=https://github.com

vasqu · 2026-03-03T11:12:17Z

tests/models/gpt2/test_modeling_gpt2.py

+
+        torch.testing.assert_close(output_eager, output_sdpa, atol=1e-5, rtol=1e-4)
+
+    @require_torch_accelerator


Suggested change

@require_torch_accelerator

@require_torch_gpu

@mark.flash_attn_test

OiPunk and others added 2 commits March 3, 2026 00:14

Sync scaling fix to DecisionTransformerGPT2Attention (Copied from GPT…

3c3a9e1

…2Attention) Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

vasqu reviewed Mar 2, 2026

View reviewed changes

vasqu mentioned this pull request Mar 2, 2026

GPT2 attention scaling config is ignored when using SDPA / FlashAttention backends #44380

Open

4 tasks

vasqu reviewed Mar 3, 2026

View reviewed changes

		if self.scale_attn_weights:
		self.scaling = self.head_dim**-0.5

		if self.scale_attn_by_inverse_layer_idx:
		self.scaling /= float(self.layer_idx + 1)

		model_sdpa = GPT2LMHeadModel(config).to(torch_device).eval()
		model_sdpa.load_state_dict(model_eager.state_dict())

	# FlashAttention2
	# Flash Attention 2 (was buggy: ignored scaling configs)

	torch.testing.assert_close(output_eager, output_fa2, atol=5e-3, rtol=5e-3)
	torch.testing.assert_close(output_eager, output_fa2, atol=1e-2, rtol=1e-2)


		torch.testing.assert_close(output_eager, output_sdpa, atol=1e-5, rtol=1e-4)

		@require_torch_accelerator

	@require_torch_accelerator
	@require_torch_gpu
	@mark.flash_attn_test

Conversation

OiPunk commented Mar 2, 2026

What does this PR do?

Changes

Behavior table (before → after)

Uh oh!

vasqu left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

OiPunk commented Mar 3, 2026

Uh oh!

github-actions bot commented Mar 3, 2026

Uh oh!

vasqu left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants