#918464 nocache.c:148: init_mutexes: Assertion `fds_lock != NULL' failed.

Package:
src:nocache
Source:
nocache
Submitter:
Santiago Vila
Date:
2024-06-04 11:48:03 UTC
Severity:
normal
Tags:
#918464#5
Date:
2019-01-04 22:49:09 UTC
From:
To:
Dear maintainer:

I tried to build this package in sid but it failed:
--------------------------------------------------------------------------------
[...]
 debian/rules build-arch
dh build-arch
   dh_update_autotools_config -a
   dh_autoreconf -a
   dh_auto_configure -a
   dh_auto_build -a
	make -j1 "INSTALL=install --strip-program=true"
make[1]: Entering directory '/<<PKGBUILDDIR>>'
cc -g -O2 -fdebug-prefix-map=/<<PKGBUILDDIR>>=. -fstack-protector-strong -Wformat -Werror=format-security -Wall -Wdate-time -D_FORTIFY_SOURCE=2 -Wl,-z,relro -Wl,-z,now -o cachedel cachedel.c
cc -g -O2 -fdebug-prefix-map=/<<PKGBUILDDIR>>=. -fstack-protector-strong -Wformat -Werror=format-security -Wall -Wdate-time -D_FORTIFY_SOURCE=2 -Wl,-z,relro -Wl,-z,now -o cachestats cachestats.c
cc -g -O2 -fdebug-prefix-map=/<<PKGBUILDDIR>>=. -fstack-protector-strong -Wformat -Werror=format-security -Wall -Wdate-time -D_FORTIFY_SOURCE=2 -Wl,-z,relro -Wl,-z,now -fPIC -c -o nocache.o nocache.c
cc -g -O2 -fdebug-prefix-map=/<<PKGBUILDDIR>>=. -fstack-protector-strong -Wformat -Werror=format-security -Wall -Wdate-time -D_FORTIFY_SOURCE=2 -Wl,-z,relro -Wl,-z,now -fPIC -c -o fcntl_helpers.o fcntl_helpers.c
cc -g -O2 -fdebug-prefix-map=/<<PKGBUILDDIR>>=. -fstack-protector-strong -Wformat -Werror=format-security -Wall -Wdate-time -D_FORTIFY_SOURCE=2 -Wl,-z,relro -Wl,-z,now -fPIC -c -o pageinfo.o pageinfo.c
cc -g -O2 -fdebug-prefix-map=/<<PKGBUILDDIR>>=. -fstack-protector-strong -Wformat -Werror=format-security -Wall -Wdate-time -D_FORTIFY_SOURCE=2 -Wl,-z,relro -Wl,-z,now -pthread -shared -Wl,-soname,nocache.so -o nocache.so nocache.o fcntl_helpers.o pageinfo.o -ldl
sed 's!##libdir##!$(dirname "$0")!' <nocache.in >nocache
chmod a+x nocache
make[1]: Leaving directory '/<<PKGBUILDDIR>>'
   debian/rules override_dh_auto_test
make[1]: Entering directory '/<<PKGBUILDDIR>>'
## #916415
timeout 11 ./nocache apt show coreutils 1>>/dev/null

WARNING: apt does not have a stable CLI interface. Use with caution in scripts.

make[1]: *** [debian/rules:21: override_dh_auto_test] Error 124
make[1]: Leaving directory '/<<PKGBUILDDIR>>'
make: *** [debian/rules:10: build-arch] Error 2
dpkg-buildpackage: error: debian/rules build-arch subprocess returned exit status 2
--------------------------------------------------------------------------------

To be sure, I have tried to build the package 151 times on 8 different machines
and it failed 151 times. Here are the full build logs:

https://people.debian.org/~sanvila/build-logs/nocache/

A very similar failure happened here in mipsel, a release architecture:

https://buildd.debian.org/status/fetch.php?pkg=nocache&arch=mipsel&ver=1.1-1&stamp=1546582253&raw=0

If you need help to reproduce this, please say so, I would gladly offer access to a system
where this seems to happen all the time.

Thanks.

#918464#10
Date:
2019-01-04 23:43:14 UTC
From:
To:
tags 918316 + patch
thanks

The patch below works for me:
--- a/debian/rules +++ b/debian/rules @@ -18,5 +18,5 @@ override_dh_auto_test: ifeq (,$(filter nocheck,$(DEB_BUILD_OPTIONS))) # -NOCACHE_NR_FADVISE=2 dh_auto_test -v ## #916415 - timeout 11 ./nocache apt show coreutils 1>>/dev/null + timeout 60 ./nocache apt show coreutils 1>>/dev/null endif Note: I don't quite understand the purpose of the timeout. Is it really useful/required to set a timeout at all? Normally sbuild (the autobuilder program used by the build daemons) has already a built-in timeout mechanism which prevents the autobuilder to be stuck forever, and by looking at build logs from reproducible builds, I believe pbuilder has also a timeout by default. Thanks.
#918464#17
Date:
2019-01-05 07:54:39 UTC
From:
To:
I get a different error here:

,----
| ## #916415
| timeout 11 ./nocache apt show coreutils 1>>/dev/null
| apt: nocache.c:148: init_mutexes: Assertion `fds_lock != NULL' failed.
| Aborted
| make[1]: *** [debian/rules:21: override_dh_auto_test] Error 134
`----

Increasing the timeout to 60 as you suggested does not help.

Cheers,
       Sven

#918464#22
Date:
2019-01-06 07:07:11 UTC
From:
To:
Hello,

Bug #918316 in nocache reported by you has been fixed in the
Git repository and is awaiting an upload. You can see the commit
message below and you can check the diff of the fix at:

https://salsa.debian.org/debian/nocache/commit/4cc35e3d2042b7f80bec7f31f7ed4d1fef329c75
------------------------------------------------------------------------
rules: increase test timeout (Closes: #918316).

Thanks, Santiago Vila.
------------------------------------------------------------------------

(this message was generated automatically)
-- 
Greetings

https://bugs.debian.org/918316

#918464#29
Date:
2019-01-06 07:11:04 UTC
From:
To:
Hi Santiago,

Thanks for the patch.

I see, this issue is environment specific and seems to fail on sloe(er)
machines like MIPS.

In this case it is _necessary_. As you could notice from comment, this is a
regression test for #916415. Timeout is required because process never exit
(hangs) when test fails.

Timeout here is to abort a particular test if/when it fails. It is better to
fail quickly (within a minute) rather than needlessly occupy builder for an
hour.
--- A man does what he must - in spite of personal consequences, in spite of obstacles and dangers and pressures - and that is the basis of all human morality. -- Winston Churchill
#918464#34
Date:
2019-01-06 07:20:05 UTC
From:
To:
Exit code suggests that APT is not happy hence timeout have nothing to do
with that so I suspect this is unrelated to "nocache".

Can you reproduce manually by "apt show coreutils"?

Also, on which architecture is this?

Thanks.
--- You have to start with the truth. The truth is the only way that we can get anywhere. Because any decision-making that is based upon lies or ignorance can't lead to a good conclusion. -- Julian Assange, 2010
#918464#39
Date:
2019-01-06 11:02:29 UTC
From:
To:
Control: clone -1 -2
Control: retitle -2 nocache.c:148: init_mutexes: Assertion `fds_lock != NULL' failed.
Control: severity -2 normal

That is unrelated to Santiago's problem, and I should have reported it
separately.  Creating a new clone now, will followup when I have the
cloned bug's number.

Cheers,
       Sven

#918464#50
Date:
2019-01-06 12:04:19 UTC
From:
To:
[Following up on the cloned bug 918464 and dropping Santiago from CC.]

ITYM it has nothing to do with timeout.  As the failed assertion comes
from nocache.c, it certainly has to do with nocache. ;-)

No, but with any program under nocache, e.g. "nocache true".

Plain amd64.


The good news is that I seem to have found the explanation for the
failed assertion.  In line 147 of nocache.c we have

    fds_lock = malloc(max_fds * sizeof(*fds_lock));

and malloc obviously returned NULL.  With a debug printf statement I
found out that max_fds == 1073741816, with sizeof(*fds_lock) == 40 it is
not too surprising that malloc failed.

Why is max_fds so high?  In the systemd changelog I found out the
following:

,----
| systemd (240-2) unstable; urgency=medium
|
|   * Don't bump fs.nr_open in PID 1.
|     In v240, systemd bumped fs.nr_open in PID 1 to the highest possible
|     value. Processes that are spawned directly by systemd, will have
|     RLIMIT_NOFILE be set to 512K (hard).
|     pam_limits in Debian defaults to "set_all", i.e. for limits which are
|     not explicitly configured in /etc/security/limits.conf, the value from
|     PID 1 is taken, which means for login sessions, RLIMIT_NOFILE is set to
|     the highest possible value instead of 512K. Not every software is able
|     to deal with such an RLIMIT_NOFILE properly.
|     While this is arguably a questionable default in Debian's pam_limit,
|     work around this problem by not bumping fs.nr_open in PID 1.
|     (Closes: #917167)
|
|  -- Michael Biebl <biebl@debian.org>  Thu, 27 Dec 2018 14:03:57 +0100
`----

And this sid system has an uptime of 13 days, so was booted with systemd
240-1 which explains the high RLIMIT_NOFILE.  On a freshly booted
laptop, I get max_fds == 1048576 instead, and obviously malloc'ing 40
Megabytes rather than 40 Gigabytes of RAM is easily possible.

I guess I should reboot in the near future.  Feel free to close the bug
if you think that dealing with a too high value of RLIMIT_NOFILE is not
possible for nocache.

Cheers,
       Sven

#918464#61
Date:
2024-06-04 11:26:14 UTC
From:
To:
Hi,

Following a full-upgrade on two Debian Sid hosts of mine on 2024-06-02
around 21:55 UTC, I have just stumbled upon this issue.

It matches the explanation provided by Sven and can be worked around by
lowering the hard NOFILE rlimit, e.g. ulimit -Hn 10000

However, the fact that nocache, a program typically used to leave global
memory usage untouched, triggers an OOM is particularly ironic. It would
be nice if this could be fixed, either in nocache itself or by adjusting
default rlimits.

#918464#66
Date:
2024-06-04 11:44:14 UTC
From:
To:
Addendum: this issue was seemingly fixed upstream:
https://github.com/Feh/nocache/commit/7451e161997d4282dd6b66fd1514b5b157b41f8a

Therefore, this bug could be fixed by packaging nocache v1.2, tagged two
years ago.