Hello,
Full disclosure: I used Claude for a lot of this work, including the text
below. That said, I was thorough in the testing, careful to verify the bug
and the fix, and have read through the details below -- but I don't want to
pretend there weren't robots involved.
Summary
-------
OpenVPN's DCO (data-channel offload) userspace code in 2.6.x can desync its
client table from the kernel under a peer-deletion storm: it leaks client
instances until the server reaches --max-clients and refuses
new/reconnecting
clients, recoverable only by restarting openvpn. This is fixed upstream
in 2.7
by commit 7791f535 ("dco: process messages immediately after read", fixes
OpenVPN issue #919), but that fix is not in the 2.6 branch, and trixie ships
2.6.14.
This is the userspace counterpart to the kernel-module crash I reported in
#1140548 (openvpn-dco-dkms). A DCO server needs both fixes to be safe under
load: #1140548 stops the kernel crash; this one stops the userspace
client-table desync. (I confirmed the two are independent -- with the
openvpn-dco-dkms fix applied but stock openvpn, the desync still occurs;
with
this openvpn fix applied but the stock buggy module, the kernel still
crashes.)
To be clear, this is a separate bug from #1140548 in a different source
package
(openvpn vs openvpn-dco-dkms) -- not a duplicate; both fixes are
required and
neither substitutes for the other.
The bug
-------
On a busy DCO server, when many peers are deleted in a short window (e.g. a
network blip that makes a large number of peers hit --keepalive timeout at
nearly the same time), libnl delivers multiple netlink messages per
nl_recvmsgs(), but the 2.6 DCO read path stores each notification into
single-slot dco_context_t fields and processes only the last, silently
dropping the rest. openvpn then never reaps those client instances ->
n_clients is never decremented -> the instances leak. Once enough leak, new
and reconnecting clients are rejected with:
MULTI: new incoming connection would exceed maximum number of clients
and the only recovery is restarting openvpn. Both the management
interface and
the status file report the inflated client count (they read the same
multi_context list).
Reproduced, and the fix tested
------------------------------
Reproduced against the stock Debian package (2.6.14 + openvpn-dco-dkms):
1024
DCO clients, server under heavy CPU load (stress-ng), then drop the server's
inbound VPN packets so all peers time out near-simultaneously. The kernel
deletes all 1024 peers but openvpn is left with a nondeterministic number of
stale instances (one run left 767/1024), and reconnects are then refused.
I built a patched package (2.6.14 + the attached patch) and re-ran the same
load: the desync is gone -- repeated 1024-peer deletion storms under
stress-ng
drain cleanly to 0 with no stale instances, reconnect to 1024 with zero
max-clients rejections, no regressions, and it survives reboot.
Patch
-----
Attached is a quilt patch (DEP-3 header). I generated it with quilt
against the
current openvpn source package, so it slots in at the end of the existing
debian/patches series, and the resulting package builds and passed all
the load
testing above -- i.e. it is ready to drop into debian/patches/ and add
to the
series as-is. Please note it is an *adaptation* of upstream
7791f535, not a clean cherry-pick: 7791f535 depends on two earlier 2.7
commits
that are not in release/2.6 (a699681b, which introduces dco->c, and
7f5a6dea,
which introduces c->multi), and the intervening 2.7 DCO/new-module
rework means
neither those nor 7791f535 apply to 2.6.14 (they conflict in 5 and 7 files
respectively). So the patch reimplements 7791f535's logic against the old
ovpn-dco API and inlines the two prerequisites as small c/multi
back-pointers
in dco_context_t. It is therefore worth a real review (back-pointer
lifetime,
the server/client dispatch, the lock placement) rather than a verbatim diff.
The option I'd like to offer
----------------------------
To be explicit: the attached patch can be applied as-is now -- you do
not need
to wait for upstream to ship anything. It's a forwarded delta (see
below), it
slots into the existing series, and the built package passed the load
testing
above. This would fit the same trixie stable-update path as the
openvpn-dco-dkms fix in #1140548.
Also raised upstream
--------------------
I have also posted this to the openvpn-devel mailing list, asking
whether they
will backport 7791f535 to release/2.6 and in what form (verbatim
prerequisites
vs. a minimal adaptation like mine):
https://sourceforge.net/p/openvpn/mailman/openvpn-devel/thread/5afdb852-eabf-4829-b95f-6a322ed5d56a%40midjourney.com/#msg59351167
That thread is for coordination, not a blocker. Upstream may well decline a
release/2.6 backport at all -- 2.6 + the out-of-tree ovpn-dco module is the
older path, and their answer may be "move to 2.7 + the in-tree module."
If so,
carrying this delta is the practical way to fix 2.6.x users regardless
of what
upstream decides; and if they do land an official release/2.6 commit, it can
simply replace this delta later. (If you carry it, the DEP-3 Forwarded:
field
should point at that mailing-list post once it has an archive URL --
happy to
provide that.)
Full validation details (logs, etc.) available on request.
Thanks,
Thomas Nyberg