#1111954 guile-fibers: flaky autopkgtest: assert terminates: (run-fibers (lambda () (spawn-fiber-chain 5000)) #:drain? #t): Aborted

Package:
src:guile-fibers
Source:
src:guile-fibers
Submitter:
Paul Gevers
Date:
2025-09-18 13:51:01 UTC
Severity:
normal
Tags:
#1111954#5
Date:
2025-08-24 09:30:40 UTC
From:
To:
Dear maintainer(s),

I looked at the results of the autopkgtest of your package. I noticed
that it regularly fails. I scheduled 10 tests yesterday in testing on
amd64 and it failed 8 times. I've put some of the output of one of the
failures below, but it seems like the failures are not always the same
(across architectures).

Because the unstable-to-testing migration software now blocks on
regressions in testing, flaky tests, i.e. tests that flip between
passing and failing without changes to the list of installed packages,
are causing people unrelated to your package to spend time on these
tests.

Don't hesitate to reach out if you need help and some more information
from our infrastructure.

Paul

https://ci.debian.net/packages/g/guile-fibers/testing/amd64/63640034/

129s assert terminates: (run-fibers (lambda () (spawn-fiber-chain 50))
#:drain? #t): ok (0.206824881 s)
129s assert terminates: (run-fibers (lambda () (spawn-fiber-chain 500))
#:drain? #t): ok (0.166856084 s)
130s assert terminates: (run-fibers (lambda () (spawn-fiber-chain 5000))
#:drain? #t): Aborted
130s autopkgtest [14:37:45]: test guile-tests-basic:
-----------------------]

#1111954#10
Date:
2025-08-24 12:10:38 UTC
From:
To:
Thanks for the report!  Here is my analysis, mostly intended as
reference for a future upstream bug report.

List of all debci jobs:
https://ci.debian.net/packages/g/guile-fibers/testing/amd64/

Click on the 'test log' link in the 'Results' column for those lines
where 'Status' indicate 'fail'.

I have triggered a bunch of jobs for some other archs too, but this
appears to be amd64-specific:

https://ci.debian.net/packages/g/guile-fibers/testing/arm64/
https://ci.debian.net/packages/g/guile-fibers/testing/ppc64el/
https://ci.debian.net/packages/g/guile-fibers/testing/riscv64/
https://ci.debian.net/packages/g/guile-fibers/testing/s390x/

I have seen these failures myself earlier, but they usually just
disappear when I try to debug it, and buildds and Salsa CI is more often
happy than debci is, for some reason.

Reviewing the failures, it seems all of them are 'guile-tests-basic'
which is:

Test-Command: guile tests/basic.scm
Features:
 test-name=guile-tests-basic,
Restrictions:
 allow-stderr,
Depends:
 guile-3.0,
 guile-fibers,

The failures all fail before <5 minutes, so there are no infloops or
long delays here.  They all also fail on exactly the same line:

assert terminates: (run-fibers (lambda () (spawn-fiber-chain 5000)) #:drain? #t): Aborted

Paul, would the patch below improve the situation for you in Debian, or
doesn't it matter until we stop making this test flaky?  I suppose we
could remove the test from the debian/tests/ but I believe it actually
indicate a serious upstream problem that we want to get resolved.

Btw, what is the workflow that ends up noticing about flaky test in
guile-fibers?  I would expect guile-fibers to not have any reverse build
dependencies in Debian except for packages I work on.

/Simon

diff --git a/debian/tests/control b/debian/tests/control
index 248e3da..d5d09cb 100644
--- a/debian/tests/control
+++ b/debian/tests/control
@@ -3,6 +3,7 @@ Features:
  test-name=guile-tests-basic,
 Restrictions:
  allow-stderr,
+ flaky,
 Depends:
  guile-3.0,
  guile-fibers,

Paul Gevers <elbrus@debian.org> writes:

#1111954#19
Date:
2025-08-24 13:24:25 UTC
From:
To:
Hi Simon,


Interesting. Our amd64 worker is the most powerful host that we have.
Might it be a race condition, or something related to parallelism?


Sure it does help on the infrastructure, but it does paper over the real
problem.


Are you talking about only this test, or the whole test stanza in
d/t/control? In my opinion removing an individual test from a suite is
better than marking a full stanza as flaky.


I'm currently checking all packages where the last "pure" testing run in
the last 2 months is failing. This includes flaky tests.

Paul

#1111954#24
Date:
2025-08-25 07:51:24 UTC
From:
To:
Hi Paul,

As more reference for upstream analysis, we got another debci failure on
arm64, note that it took almost 3 hours to complete:

https://ci.debian.net/packages/g/guile-fibers/testing/arm64/63660342/#S14

126s autopkgtest [11:57:22]: test guile-tests-foreign: [-----------------------
127s assert #f equal to #f: ok
127s assert #t terminates: ok
128s assert (sleep 1) terminates: ok
129s assert (perform-operation (sleep-operation 1)) terminates: ok
129s assert (receive-from-fiber 42) equal to 42: ok
10126s assert (send-to-fiber 42) equal to 42: autopkgtest [14:44:02]: ERROR: timed out on ...
10127s autopkgtest [14:44:03]: test guile-tests-foreign: -----------------------]

Another for tests-channels" on riscv64, also took several hours:

https://ci.debian.net/packages/g/guile-fibers/testing/riscv64/63660348/

409s autopkgtest [12:02:15]: test guile-tests-channels: [-----------------------
20409s assert run-fibers on (rpc 1) terminates: autopkgtest [17:35:35]: ERROR: timed out on ...
20409s autopkgtest [17:35:35]: test guile-tests-channels: -----------------------]

Another one for "tests-foreign", also took hours:

https://ci.debian.net/packages/g/guile-fibers/testing/riscv64/63660356/#S14

581s autopkgtest [12:05:48]: test guile-tests-foreign: [-----------------------
584s assert #f equal to #f: ok
584s assert #t terminates: ok
585s assert (sleep 1) terminates: ok
586s assert (perform-operation (sleep-operation 1)) terminates: ok
20581s assert (receive-from-fiber 42) equal to 42: autopkgtest [17:39:08]: ERROR: timed out on ...
20581s autopkgtest [17:39:08]: test guile-tests-foreign: -----------------------]

These tests should be fairly quick, certainly not taking hours.  Is
there a way to lower the debci timeout to say 30 minutes to avoid them
consuming CPU time?

Paul Gevers <elbrus@debian.org> writes:

Yes this is a parallelism-heavy package, and the self-tests stress this.

On old systems I would expect this to trigger libc and kernel bugs, but
I think on any modern system the problem is more likely to be within
guile-fibers.

I was thinking how to lower the severity of this bug report.

Is 'Serious' the right criticality for a flaky debci failure?

If we mark the (apparently) flaky tests with 'flaky', would it still be
'Serious'?

My plan is to make another upload, and for all the tests we've seen are
flaky, mark them as 'flaky' so they hopefully won't disturb any debci
workflow as much.  Maybe this allow it to lower the severity to Normal
and consider this an upstream bug?  I suspect it will take time to
resolve, I started a similar dance with Shepherd upstream bugs a couple
of months ago about flaky tests and we are still not finished.

Each test has its own stanza.  I don't think it is possible to separate
each test further in any simple way.  If one of the stanzas fails
spuriously, I think the right thing is to mark that one flaky until
upstream resolve it (or we realize it is a Debian-specific problem).

I see, thank you for explaining and doing that!

/Simon

#1111954#29
Date:
2025-08-25 17:49:10 UTC
From:
To:
Hi,


My (as a Release Team member) threshold is somewhere around 1/5 to 1/7.


If the test isn't marked as FAILED, it's no longer serious. So yes,
marking a test flaky will lower the severity of this bug report.

Paul

#1111954#34
Date:
2025-08-25 18:04:12 UTC
From:
To:
Paul Gevers <elbrus@debian.org> writes:

Ok.  Let's see if 1.3.1-8 improves things, it marks tests as flaky that
I've witnessed flip-flop on some archs.  I may have missed some, we'll
see: I queued a couple of debci runs of it on all archs.  I am hoping
that flaky tests won't result in FAILED even on failure, but end up in
some other state that is ignored by people who keep debci running.

I suggest to use this bug report to discuss (and tag) flaky tests in
Debian, but reports to FIX the tests to not fail at all in the first
place should go upstream.  I am hoping Ludo' has some cycles to look
into these reports, or at least push out a new upstream release shortly
to see if anything applied since the last release already address this.

/Simon

#1111954#39
Date:
2025-09-10 18:11:07 UTC
From:
To:
severity 1111954 normal
forwarded 1111954 https://codeberg.org/fibers/fibers/issues/127
thanks

As far as I can tell, all flaky debci checks have now been marked as
flaky and does not trigger debci FAIL any more after 1.4.0-1, so I'm
hoping we can lower the severity of this now.

Fixing the flaky tests have been reported upstream:

https://codeberg.org/fibers/fibers/issues/127

I suspect this may take time to fix, as it could be several different
underlying bugs, and could be kernel/arch/libc-related.

Builds are still flaky due to this problem, but the buildd's appear to
re-schedule builds and eventually succeeds.  We could silence 'make
check' failure during builds too if necessary.  I'm not sure what the
best recommended way to deal with flaky build failures.  Having them
FAIL but eventually succeeds allows us to more easily catch which self
checks and which platforms fail, which may result in some pattern
developing, and help with upstream bug reporting.  But if this is an
annoyance we can do something like:

override_dh_auto_test:
        -dh_auto_test

or something fancier if there is some subset of tests that actually
always manages to succeed (as suggested by those debian/tests/control
tests without the flaky mark).

/Simon

#1111954#46
Date:
2025-09-18 06:38:28 UTC
From:
To:
Control: severity -1 grave
Control: tags -1 ftbfs
Control: found -1 1.3.1-5

These are manual give-backs.

Please fix your package to build reliably.

This won't work, since the build is killed.

And what you have done in debci is also a huge waste of resources.

"flaky" means a debci builder is blocked for 3 or 6 hours:
https://ci.debian.net/packages/g/guile-fibers/testing/ppc64el/
https://ci.debian.net/packages/g/guile-fibers/testing/s390x/
https://ci.debian.net/packages/g/guile-fibers/testing/riscv64/

Your test hangs are also blocking a buildd for 2.5 or 10 hours,
or a builder in the reproducible infrastructure for 18 hours:
https://tests.reproducible-builds.org/debian/history/guile-fibers.html
https://buildd.debian.org/status/logs.php?pkg=guile-fibers&arch=riscv64
https://buildd.debian.org/status/logs.php?pkg=guile-fibers&arch=arm64
https://buildd.debian.org/status/logs.php?pkg=guile-fibers&arch=s390x

When a release architecture has 2 buildds and you test hang blocks one
of them for 2.5 hours, then this architecture has lost half its buildds
for 2.5 hours.

Implementing a per-test timeout with a small timeout per test inside
your package and then ignoring test failures might be OK, but please
stop blocking various kinds of builders for hours with hangs.

cu
Adrian

#1111954#57
Date:
2025-09-18 12:50:31 UTC
From:
To:
Sorry about this - I was hoping upstream's patches in latest release
would have fixed this (the forwarded bug report is closed as such), but
it seems these self-checks have some more work before they are ready.

I made it so that self-checks doesn't trigger FTBFS and all checks are
wrapped in a "timeout" of max 15m runtime, see:

https://salsa.debian.org/debian/guile-fibers/-/commit/225287f55d690f9a1fe04618a813d05e69ca969c

I'm hoping this will move back the build/test time within reason.  Maybe
it can be tightened further, a build finishes within a minute or so on
my laptop, but I suspect a busy armel server could be 10-15x as slow.

/Simon

Adrian Bunk <bunk@debian.org> writes:

#1111954#62
Date:
2025-09-18 13:04:22 UTC
From:
To:
We believe that the bug you reported is fixed in the latest version of
guile-fibers, which is due to be installed in the Debian FTP archive.

A summary of the changes between this version and the previous one is
attached.

Thank you for reporting the bug, which will now be closed.  If you
have further comments please address them to 1111954@bugs.debian.org,
and the maintainer will reopen the bug report if appropriate.

Debian distribution maintenance software
pp.
Simon Josefsson <simon@josefsson.org> (supplier of updated guile-fibers package)

(This message was generated automatically at their request; if you
believe that there is a problem with it please contact the archive
administrators by mailing ftpmaster@ftp-master.debian.org)
Format: 1.8
Date: Thu, 18 Sep 2025 14:28:33 +0200
Source: guile-fibers
Architecture: source
Version: 1.4.1-2
Distribution: unstable
Urgency: medium
Maintainer: Simon Josefsson <simon@josefsson.org>
Changed-By: Simon Josefsson <simon@josefsson.org>
Closes: 1111954
Changes:
 guile-fibers (1.4.1-2) unstable; urgency=medium
 .
   * Add 15m test timeout and don't FTBFS on tests.  Closes: #1111954.
Checksums-Sha1:
 72c828a42d8ed48224e3e605c7e3b58691d44bbc 2261 guile-fibers_1.4.1-2.dsc
 a85375e92a04c752228973e584db1422e81e0a9d 5044 guile-fibers_1.4.1-2.debian.tar.xz
 e23ac04aac42b55b452b81b4bdd3249e277ea906 225548 guile-fibers_1.4.1-2.git.tar.xz
 94057ccab4bcb2fc5088847795c25b5162413b7a 18238 guile-fibers_1.4.1-2_source.buildinfo
Checksums-Sha256:
 1df31edcfbda76d2965d5d5815a86fb5f7c10df90091d66e3dc3630f1862ede4 2261 guile-fibers_1.4.1-2.dsc
 4173bb6743010ed4c81ecb43c79bfd2b23d4049d0d6cd91e5479c16198599abf 5044 guile-fibers_1.4.1-2.debian.tar.xz
 19b9e19613403feebfd635d7ff0a271e31946cfb58bbefb370b023ad68f4474f 225548 guile-fibers_1.4.1-2.git.tar.xz
 761420111c3aa6ebe50ac70071608a99ea9ff48068efe94e941701d8d533149d 18238 guile-fibers_1.4.1-2_source.buildinfo
Files:
 d76719b0b4a3e32b6e1e13b66349f787 2261 libs optional guile-fibers_1.4.1-2.dsc
 6fb7947a3f2f499e28b89e318e710217 5044 libs optional guile-fibers_1.4.1-2.debian.tar.xz
 a6549eeef865a3448eca4618ab1e2240 225548 libs optional guile-fibers_1.4.1-2.git.tar.xz
 c504635191b53f727ca1664c663cfa0b 18238 libs optional guile-fibers_1.4.1-2_source.buildinfo
Git-Tag-Info: tag=d3ad1e0ac172a59a604435dfb643d545e72b148b fp=a3cc9c870b9d310abad4cf2f51722b08fe4745a2
Git-Tag-Tagger: Simon Josefsson <simon@josefsson.org>
-----BEGIN PGP SIGNATURE-----

iQIzBAEBCgAdFiEEN02M5NuW6cvUwJcqYG0ITkaDwHkFAmjL/sMACgkQYG0ITkaD
wHmJgA/+NjH6JmeWFS3VnmfUhYj1dlxvF+50LDnqi9nkLPvAaCUsN1JEiOiIO0kM
lySgkWeg3/dhQESZzlUFIgzW6LY7kbOhxOIsjNBIqzxf2hgd7bIVccxp6aIHgdBW
RUNDr+B6XEiGMA6+MbOCDiJr3ndrLoARZjUGixSxFyZbDjB4mC8hvgxaGF8adeTU
UK3AVysPnQLfLRi8FoS0YPO4BVPLe9/3QvD1sZBWZ3Shvb9LXSuw/a7G9nFIH//7
D32ip1ebfgcWsftus0+fRMglsOBPVbr3/15BMeC6b1v9VDqLceNSd45zojt2oqkr
VZ5wO6PzGDErwxoItTFr/afmbXf5hdlOz3eKbPC7ZGJSttWGyLtw7gPU2CHBWT8M
w1yic5uUKZ1m4S5+b7y3F5RkzA6ZgTZx4aqPJ9DW9VsP4SyKfD8dd77rHc8SGsWa
8bCvK9FOIrTk3vbVC+f1gNrhgesw4iT0ZCOEL9FRcqk7Aur8l1qrr8iRt6dUUCYZ
OD+PI6vUjZZ6lW4WuoemtUd3TLt8rJKbOLtFYnIcc/7yim9jOqPhGoNt4qAZyeWn
fytfno663DutFFD7NB/9edW+ljWhcJf+btiDyNrBIk3QQvR4Q3SVQoGa/sVa6Gs5
Y8152/Dtn4E2PAcgBgWxLBI/ja2EmBXtEJf3mIfTlFgUqZaWhKs=
=JeZb
-----END PGP SIGNATURE-----

#1111954#67
Date:
2025-09-18 13:49:09 UTC
From:
To:
No worries, I understand you were not aware what burden your package
created for our infrastructure.

buildds never build more than 1 package at a time,
which makes build times usually quite predictable.

armel (which is already for some time built on arm64 hw) is not slower
than ppc64el here:
https://buildd.debian.org/status/logs.php?pkg=guile-fibers&arch=armel
https://buildd.debian.org/status/logs.php?pkg=guile-fibers&arch=ppc64el
(Concurrency in Lisp might show performance patterns quite different
from average packages.)

15 minutes is too tight on riscv64:
https://buildd.debian.org/status/logs.php?pkg=guile-fibers&arch=riscv64
(Which is not a real problem when you are anyway ignoring failures.)

All this is just a workaround for running tests with flaky hangs that
should be fixed, really generous timeouts (or relying on buildd/debci
timeouts) are OK when test hangs are a "never happens" event indicating
some unexpected serious breakage.[1]

Thanks
Adrian

[1] a common problem timeouts is that they are often too low and then
    slow buildds hit them, which is why tight timeouts are usually not
    a good idea