#1108860 linux-image-6.1.0-34-amd64: Wireguard fragmentation fails with VXLAN since kernel 6.1.0-34, causing network timeouts

Package:
src:linux
Source:
src:linux
Submitter:
Charles Bordet
Date:
2026-05-05 18:01:02 UTC
Severity:
normal
Tags:
#1108860#5
Date:
2025-07-06 12:57:41 UTC
From:
To:
Dear Maintainer,

What led up to the situation?
We run a production environment using Debian 12 VMs, with a network
topology involving VXLAN tunnels encapsulated inside Wireguard
interfaces. This setup has worked reliably for over a year, with MTU set
to 1500 on all interfaces except the Wireguard interface (set to 1420).
Wireguard kernel fragmentation allowed this configuration to function
without issues, even though the effective path MTU is lower than 1500.

What exactly did you do (or not do) that was effective (or ineffective)?
We performed a routine system upgrade, updating all packages include the
kernel. After the upgrade, we observed severe network issues (timeouts,
very slow HTTP/HTTPS, and apt update failures) on all VMs behind the
router. SSH and small-packet traffic continued to work.

To diagnose, we:

* Restored a backup (with the previous kernel): the problem disappeared.
* Repeated the upgrade, confirming the issue reappeared.
* Systematically tested each kernel version from 6.1.124-1 up to
6.1.140-1. The problem first appears with kernel 6.1.135-1; all earlier
versions work as expected.
* Kernel version from the backports (6.12.32-1) did not resolve the
problem.

What was the outcome of this action?

* With kernel 6.1.135-1 or later, network timeouts occur for
large-packet protocols (HTTP, apt, etc.), while SSH and small-packet
protocols work.
* With kernel 6.1.133-1 or earlier, everything works as expected.

What outcome did you expect instead?
We expected the network to function as before, with Wireguard handling
fragmentation transparently and no application-level timeouts,
regardless of the kernel version.

#1108860#10
Date:
2025-07-06 13:33:55 UTC
From:
To:
Hi Charles,

Thanks for the report and narrowing down the version where the issue
is introduced on Debian side.

Since you seem to reliably reproduce the issue, would it be possible
that you bisect the changes between 6.1.133 upstream and 6.1.135 now
that we can find the offending commit and make a report upstream?

Additionally, would it be possible that you try directly as well the
kernel from unstable 6.12.35-1 and 6.15.4-1~exp1 from experimental to
determine if the issue is unresolved there?

Regards,
Salvatore

#1108860#23
Date:
2025-07-06 17:43:35 UTC
From:
To:
Hi,

Thank you for the quick reply.

We tried kernel versions 6.12.35-1 from unstable and 6.15.4-1 from experimental and the issue still appears on both versions.

We are currently bisecting the changes to identify the commit. This will take several days as the server is used in production and we need to minimize downtime during working hours. I will get back to this issue as soon as the commit is identified.

Thank you,
Charles

#1108860#32
Date:
2025-07-06 18:59:46 UTC
From:
To:
Hi Charles,

Ack, thanks for confirming that, I just updated the bug metadata to
reflect that.

Yes that is fully understandable. Would be ideal if that can be
reproduced under lab conditions, but then this takes just the time it
needs.

Ping us back once you have identified the breaking commit.

Thanks for your debugging.

Regards,
Salvatore

#1108860#37
Date:
2025-07-10 18:56:26 UTC
From:
To:
control: tags -1 + moreinfo

Hi Charles,
our weekly kernel-team meeting we talked about your issue. And Ben
pointed out that he saw recently a PMTU related change.

And in fact htere is 8930424777e4 ("tunnels: Accept PACKET_HOST in
skb_tunnel_check_pmtu().") which is from 6.15-rc1. And it got
backported to various stable series, for your report of interest is
that it was backported to 6.1.134, which falls exactly in the range
you noticed of breaking.

Thus: are you able to test first at all 6.1.y and a revert of the
given commit on top and see if that fixes your issue?

Regards,
Salvatore

#1108860#42
Date:
2025-07-13 06:41:39 UTC
From:
To:
Hi Salvatore,

Thank you for your guidance and for pointing out the relevant commit.

I have tested, checking out from tag 6.1.134:
- With the revert of commit b88786ea2c8f ("tunnels: Accept PACKET_HOST in skb_tunnel_check_pmtu()"): the issue does not appear and everything works as expected
- With the commit included (no revert), the issue reappears exactly as before.

This confirms that the regression is directly linked to this commit.

Is there anything else I can do or provide to help with the resolution?
Thanks,
- Charles

#1108860#49
Date:
2025-07-14 19:33:03 UTC
From:
To:
Hi Charles,

Thanks a lot that is great news, so we have isolated the regression
commit already. I will try to assemble a regression report upstream
soon (after checking if it is already known, hopefully not missing a
report) and keep you in the loop.

Regards,
Salvatore

#1108860#56
Date:
2025-07-14 19:57:52 UTC
From:
To:
Hi,

Charles Bordet reported the following issue (full context in
https://bugs.debian.org/1108860)

While triaging the issue we found that the commit 8930424777e4
("tunnels: Accept PACKET_HOST in skb_tunnel_check_pmtu()." introduces
the issue and Charles confirmed that the issue was present as well in
6.12.35 and 6.15.4 (other version up could potentially still be
affected, but we wanted to check it is not a 6.1.y specific
regression).

Reverthing the commit fixes Charles' issue.

Does that ring a bell?

Regards,
Salvatore

#1108860#63
Date:
2025-07-15 09:43:30 UTC
From:
To:
It doesn't ring a bell. Do you have more details on the setup that has
the problem? Or, ideally, a self-contained reproducer?

#1108860#68
Date:
2025-07-15 12:55:05 UTC
From:
To:
Hi Charles,

Btw, reported upstream (you are in CC) but there is need that you
answer details on the setup and/or standalone reproducer. So any help
you can provide there would be great.

Regards,
Salvatore

#1108860#73
Date:
2025-07-16 05:51:40 UTC
From:
To:
Hi,

Saw that. I will try to provide a reproducible setup with an Ansible playbook or something. Give me a few more days for that.

Thanks,
- Charles

#1108860#78
Date:
2025-07-16 07:08:58 UTC
From:
To:
Hi Charles,

That is perfect!

Regards,
Salvatore

#1108860#83
Date:
2025-07-16 12:44:55 UTC
From:
To:
Guillaume Nault <gnault@redhat.com> writes:

+1 - I tested this patch with an OVS setup using vxlan and geneve
tunnels.  A reproducer or more details would help.

#1108860#90
Date:
2025-08-30 19:03:01 UTC
From:
To:
Hi,

Charles, any news here, did you found a way to provide a
self-contained reproducer for your issue?

Does the issue still reproeduce for you on the most current version of
each of the affected dstable series?

Regards,
Salvatore

#1108860#95
Date:
2026-05-01 09:29:29 UTC
From:
To:
Hi Salvatore, Guillaume, Aaron,

Apologies for the very long silence, I was unable to provide the reproducible example, and the server being in production made it difficult to allocate time for further debugging.

I'm writing back with an update and more details.

The issue still reproduces on the latest kernel. I upgraded today to kernel 6.1.0-44-amd64 (6.1.140-1) and the regression is still present: large TCP transfers through the VXLAN-over-Wireguard tunnel time out, while small-packet traffic (SSH, DNS, ping) works fine.

More details about the setup:
Two physical hosts (router1 and pve1) connected via a Wireguard tunnel (wg-hosts, MTU 1420). A VXLAN tunnel (VNI 11, dstport 4789) runs over the Wireguard interface, and is attached to a Linux bridge (vmbr0, MTU 1500). VMs run on pve1 and are bridged to vmbr0 with MTU 1500. router1 is the default gateway (10.0.0.1) and performs SNAT for internet access.
The effective path MTU for VM traffic through the bridge -> VXLAN -> Wireguard path is 1370 bytes (1420 WG - 50 VXLAN/UDP/Ethernet overhead). Before the regression, Wireguard handled fragmentation transparently for packets exceeding this limit.

From a VM, PMTU discovery works. For example, `ip route show cache` correctly shows `mtu 1370` for all destinations. However, large TCP downloads (over 1 MB) stall and eventually time out.

I found a workaround by adding TCP MSS clamping on the forwarding path of router1:
```
table inet mangle {
    chain forward {
        type filter hook forward priority mangle; policy accept;
        tcp flags syn / syn,rst tcp option maxseg size set rt mtu
    }
}
```
With this rule, all TCP SYN/SYN-ACK packets forwarded through router1 have their MSS rewritten to match the PMTU, and large downloads work without any issue. Removing the rule brings back the timeouts.

Thank you,
- Charles