#1095028 xvfb: Race condition in xvfb-run

Package:
xvfb
Source:
xvfb
Description:
Virtual Framebuffer 'fake' X server
Submitter:
Ole Streicher
Date:
2025-08-31 12:33:05 UTC
Severity:
normal
Tags:
#1095028#5
Date:
2025-02-02 19:05:41 UTC
From:
To:
Dear maintainer,

the build of the "giza" package currently fails in some environments due
to an issue with xvfb-run [1], which is used from debian/rules.

The problem is that the xvfb-run script only checks that Xvfb is running
(by signalling with signal 0), but not whether it is actually active
(and accepts X client calls):

     wait || :
     if kill -0 $XVFBPID 2>/dev/null; then
         break

The test code in giza looks like this:

   XW.display = XOpenDisplay (NULL);
   XW.screennum = DefaultScreen (XW.display);

in some environments, the client command sometimes seems to run faster
than the server would be ready to take the X connection; then the
XOpenDisplay() returns a NULL, which in the case of giza leads to a
segmentation fault in the following line.

Having the display variable checked for NULL before using it would
certainly avoid the segfault, but still the code could not run properly
because of the missing X connection.

I think that xvfb needs to check for a properly initialized Xvfb instead
of just sending a signal to the server before running the command.

Best regards

Ole

[1] https://bugs.debian.org/1094102

#1095028#12
Date:
2025-05-25 11:32:00 UTC
From:
To:
severity 1093686 grave
affects 1093686 src:rhythmbox src:merkaartor src:libktorrent src:maliit-keyboard src:kf6-kconfig
thanks

Hello Simon. I was going to comment on #1093686 against src:rhythmbox, but then I realized
this is very likely a manifestation of this bug in xvfb-run, which is probably
the reason why several other packages FTBFS randomly as well.

In fact, all the packages below use xvfb-run in their tests, and I get random failures
with the following failure rates:

0.260 rhythmbox  (26/100)
   reported as #1093686
0.270 merkaartor  (27/100)
   (not reported yet)
0.340 libktorrent  (34/100)
   (not reported yet)
0.400 maliit-keyboard  (40/100)
   (not reported yet)
0.411 kf6-kconfig  (37/90)
   (not reported yet)

Build logs available here:

https://people.debian.org/~sanvila/build-logs/xvfb/

According to the general guidelines given by Paul in Bug #1057562,
I think all the above issues should be considered as RC.

(Note: In the above, I'm using a threshold of 1/4, which is a little
bit more permissive than the 1/6 figure suggested by Paul).

Thanks.

#1095028#27
Date:
2025-06-02 11:25:34 UTC
From:
To:
Note that xvfb-run does have logic that is intended to wait for Xvfb to be
ready: it sets SIGUSR1 to be handled (with a trivial handler), and then
the wait(1) builtin in the quoted section has (is meant to have?) two
terminating conditions:

1. the xvfb-run script receives SIGUSR1, terminating wait(1)
    unsuccessfully, after which Xvfb should already have its listening
    socket ready to receive requests, `kill -0 $XVFBPID` should still
    succeed, and then whatever tests we are running should also succeed;

2. or the Xvfb process terminates early due to an error, after which
    wait(1) exits successfully, but then `kill -0 $XVFBPID` fails
    (and then xvfb-run also fails)

But perhaps that logic is wrong? This is not a straightforward thing to
do correctly in shell script, and perhaps using a Perl or C helper would
be more reliable.

Xserver(1) documents two ways to wait for the X server to be ready. One
is the SIGUSR1 mechanism used by xvfb-run. The other is to run it with
option `-displayfd FD`, which makes it choose and output a display
number on the given fd, similar to `dbus-daemon --print-address=FD` (see
also test/simple-xinit.c and libxkbcommon_1.7.0-2/test/xvfb-wrapper.c).
This is not compatible with `xvfb-run --server-num` because it doesn't
allow the caller to influence the display number to use, but perhaps
`xvfb-run -a` could use -displayfd?

     smcv

#1095028#32
Date:
2025-06-10 04:58:15 UTC
From:
To:
Debian supporters -

This bug sounds disturbing.  I use xvfb-run "all the time"
on Bookworm, and have never seen it act up.
I agree with Simon, the symptoms sound like a race condition,
even though the shell script and its concept seem sound,
and have worked well for many years.

Does anyone have any hints on how to reproduce this occasional
failure-to-launch on a development machine?
Using amd64 bookworm as a host, I just tried:
  fresh amd64 trixie install in a chroot
  install dependencies for rhythmbox
  as unprivileged user, apt-get source rhythmbox
  cd rhythmbox-3.4.8
  dpkg-buildpackage -rfakeroot
it worked (as in, .deb files cam out, and no xvfb fault).
Not only that, but
  xvfb-run dh_auto_test -- --timeout-multiplier 3
ran 20 times in a row without a failure.

  - Larry

#1095028#43
Date:
2025-08-31 12:31:05 UTC
From:
To:
Control: reassign 1093686 src:rhythmbox
Control: reassign 1111542 src:rhythmbox
Control: merge 1093686 1111542

Looking more closely at rhythmbox, I think this is more likely to be the
actual problem with rhythmbox's tests than a race condition in xvfb-run,
so I've unmerged the two rhythmbox bugs from #1095028. rhythmbox has
several unit tests, each of which connects to the X11 display, so if the
X server has its default behaviour of "resetting" (and in the process,
briefly not listening on its socket) after each transition from 1 client
to 0 clients, we can expect that the tests will intermittently fail to
connect to X11.

It is possible that the giza package, whose test failures led to a race
condition in xvfb-run being hypothesized (#1095028), is in fact also
suffering from the same thing - but I have no particular knowledge of
the giza package or how it works, so I can't be sure about that. As a
result I've left #1095028 assigned to xvfb.

I still think that -noreset would be a better default for xvfb-run,
because the "reset" behaviour regularly gives packages a source of
intermittent test failures and I can't think of any situations where it
would actually be desirable; but if the maintainers of xvfb-run are
concerned about backward compatibility, then every package with two or
more X11-dependent unit tests (especially if they run in parallel) will
have to continue to work around it.

What I'm intending to try in rhythmbox is:

1. Run xvfb-run with `-s "-noreset"` (and other commonly-used options).
    If this makes the tests apparently reliable, great, we can stop here.

2. Add a short sleep after starting Xvfb, something like:
    xvfb-run ... sh -c 'sleep 3; exec "$@"' sh "$@"
    If there is indeed a race condition in xvfb-run, that will work
    around it by giving Xvfb an extra 3 seconds to start up, which in
    practice should be enough to make Xvfb win the race. If that makes
    the tests apparently reliable, great, we can stop here.

3. As a last resort, if neither of those works, temporarily ignore test
    failures.

     smcv