#930719 autopkgtest: Hangs with qemu backend when copyup destination failed

#930719#5
Date:
2019-06-19 08:33:05 UTC
From:
To:
In Kali we do have a debci setup at autopkgtest.kali.org, we have
configured it to use the qemu backend. Unfortunately, after a while
all our workers were stuck... upon investigation it looks like that
autopkgtest can get stuck with the qemu backend when the VM image
is not big enough for the "copyup" operation to complete.

We do have the problem with thunderbird with a 20Gb disk image
as created by autopkgtest-build-qemu and we already suggested
to increase the default value:
https://salsa.debian.org/ci-team/autopkgtest/merge_requests/49

But the real problem is that autopkgtest gets stuck. It would
be nice if it could at least fail reliably...

Here are some elements from my investigation:

$ systemctl status debci-worker-amd64-qemu@1.service
● debci-worker-amd64-qemu@1.service - debci-worker instance 1 for arch=amd64 & backend=qemu
   Loaded: loaded (/etc/systemd/system/debci-worker-amd64-qemu@.service; enabled; vendor preset: enabled)
   Active: active (running) since Fri 2019-06-07 12:25:01 UTC; 1 weeks 4 days ago
     Docs: https://ci.debian.net/doc/
 Main PID: 10128 (amqp-consume)
    Tasks: 14
   CGroup: /system.slice/system-debci\x2dworker\x2damd64\x2dqemu.slice/debci-worker-amd64-qemu@1.service
           ├─ 4866 /bin/sh /usr/share/debci/bin/debci-worker --do-request
           ├─ 4916 /bin/sh /usr/share/debci/bin/debci-worker --do-request
           ├─ 4918 /bin/sh /usr/share/debci/bin/debci-test --quiet --data-dir /var/lib/debci/tmp/tmp.xAIA5SX5eT --suite kali-rolling --print-output --run-id=332825 thunderbird
           ├─ 4979 /bin/sh /usr/share/debci/bin/debci-test --quiet --data-dir /var/lib/debci/tmp/tmp.xAIA5SX5eT --suite kali-rolling --print-output --run-id=332825 thunderbird
           ├─ 4981 /bin/sh /usr/share/debci/backends/qemu/test-package --output-dir /var/lib/debci/tmp/tmp.xAIA5SX5eT/autopkgtest-incoming/kali-rolling/amd64/t/thunderbird/332825 thunderbird
           ├─ 4985 /bin/sh /usr/share/debci/bin/debci-autopkgtest --user debci --output-dir /var/lib/debci/tmp/tmp.xAIA5SX5eT/autopkgtest-incoming/kali-rolling/amd64/t/thunderbird/332825 thunderbird -- qemu /var/lib/debci/qemu/kali-rolling-amd64.img
           ├─ 4986 /usr/bin/python3 -u /usr/bin/autopkgtest --no-built-binaries --user debci --output-dir /var/lib/debci/tmp/tmp.xAIA5SX5eT/autopkgtest-incoming/kali-rolling/amd64/t/thunderbird/332825 thunderbird -- qemu /var/lib/debci/qemu/kali-rolling-amd64.img
           ├─ 5007 tee /var/lib/debci/tmp/autopkgtest-fifo._ww_6t7s -a /dev/stderr
           ├─ 5010 cat /var/lib/debci/tmp/autopkgtest-fifo._ww_6t7s
           ├─10128 amqp-consume --url amqp://melpomene.kali.org -q debci-tests-amd64-qemu --prefetch-count 1 -- /usr/share/debci/bin/debci-worker --do-request
           └─12103 /usr/bin/python3 /var/lib/debci/tmp/autopkgtest-qemu.8b256xg_/runcmd sh -ec cd /tmp/autopkgtest.ZCpmjh/build.QA0/src/; tar --warning=none -c . -f -

Jun 17 12:21:05 melpomene.kali.org debci[10128]: libreswan kali-dev/amd64 fail
Jun 17 12:21:05 melpomene.kali.org debci[10128]: octave-vibes kali-rolling/amd64 started
Jun 17 12:23:13 melpomene.kali.org debci[10128]: octave-vibes kali-rolling/amd64 pass
Jun 17 12:23:13 melpomene.kali.org debci[10128]: liblog-agent-perl kali-dev/amd64 started
Jun 17 12:25:42 melpomene.kali.org debci[10128]: liblog-agent-perl kali-dev/amd64 pass
Jun 17 12:25:42 melpomene.kali.org debci[10128]: libnet-https-any-perl kali-dev/amd64 started
Jun 17 12:28:15 melpomene.kali.org debci[10128]: libnet-https-any-perl kali-dev/amd64 pass
Jun 17 12:28:15 melpomene.kali.org debci[10128]: libbusiness-onlinepayment-transactioncentral-perl kali-rolling/amd64 started
Jun 17 12:30:48 melpomene.kali.org debci[10128]: libbusiness-onlinepayment-transactioncentral-perl kali-rolling/amd64 pass
Jun 17 12:30:48 melpomene.kali.org debci[10128]: thunderbird kali-rolling/amd64 started

(at this point, the worker has been stuck on thunderbird for 2 days!)
---------

Extract of ps auxf :

debci    10128  0.0  0.0  24436  3140 ?        Ss   Jun07   0:00 amqp-consume --url amqp://melpomene.kali.org -q debci-tests-amd64-qemu --prefetch-count 1 -- /usr/share/debci/bin/debci-worker --do-request
debci     4866  0.0  0.0  13104  3124 ?        S    Jun17   0:00  \_ /bin/sh /usr/share/debci/bin/debci-worker --do-request
debci     4916  0.0  0.0  13104  2216 ?        S    Jun17   0:00      \_ /bin/sh /usr/share/debci/bin/debci-worker --do-request
debci     4918  0.0  0.0  13096  3276 ?        S    Jun17   0:00          \_ /bin/sh /usr/share/debci/bin/debci-test --quiet --data-dir /var/lib/debci/tmp/tmp.xAIA5SX5eT --suite kali-rolling --print-output --run-id=332825 thunderbird
debci     4979  0.0  0.0  13096  2008 ?        S    Jun17   0:00              \_ /bin/sh /usr/share/debci/bin/debci-test --quiet --data-dir /var/lib/debci/tmp/tmp.xAIA5SX5eT --suite kali-rolling --print-output --run-id=332825 thunderbird
debci     4981  0.0  0.0  12916  2680 ?        S    Jun17   0:00                  \_ /bin/sh /usr/share/debci/backends/qemu/test-package --output-dir /var/lib/debci/tmp/tmp.xAIA5SX5eT/autopkgtest-incoming/kali-rolling/amd64/t/thunderbird/332825 thunderbird
debci     4985  0.0  0.0  12912  2668 ?        S    Jun17   0:00                      \_ /bin/sh /usr/share/debci/bin/debci-autopkgtest --user debci --output-dir /var/lib/debci/tmp/tmp.xAIA5SX5eT/autopkgtest-incoming/kali-rolling/amd64/t/thunderbird/332825 thunderbird -- qemu /var/lib/debci/qemu/kali-rolling-amd64.img
debci     4986  0.0  0.0  71176 21328 ?        S    Jun17   0:02                          \_ /usr/bin/python3 -u /usr/bin/autopkgtest --no-built-binaries --user debci --output-dir /var/lib/debci/tmp/tmp.xAIA5SX5eT/autopkgtest-incoming/kali-rolling/amd64/t/thunderbird/332825 thunderbird -- qemu /var/lib/debci/qemu/kali-rolling-amd64.img
debci     5007  0.0  0.0   7580   656 ?        S    Jun17   0:01                              \_ tee /var/lib/debci/tmp/autopkgtest-fifo._ww_6t7s -a /dev/stderr
debci     5010  0.0  0.0   7716   668 ?        S    Jun17   0:01                              \_ cat /var/lib/debci/tmp/autopkgtest-fifo._ww_6t7s
debci     5014  0.0  0.0      0     0 ?        Z    Jun17   0:00                              \_ [autopkgtest-vir] <defunct>

There's another PID which likely lost its parent and now appears a child of init:
debci    12103  0.6  0.0 272020 19768 ?        Sl   Jun17  17:33 /usr/bin/python3 /var/lib/debci/tmp/autopkgtest-qemu.8b256xg_/runcmd sh -ec cd /tmp/autopkgtest.ZCpmjh/build.QA0/src/; tar --warning=none -c . -f -
----------

Logs of autopkgtest:

autopkgtest [12:30:49]: version 5.10
autopkgtest [12:30:49]: host melpomene.kali.org; command line: /usr/bin/autopkgtest --no-built-binaries --user debci --output-dir /var/lib/debci/tmp/tmp.xAIA5SX5eT/autopkgtest-incoming/kali-rolling/amd64/t/thunderbird/332825 thunderbird -- qemu /var/lib/debci/qemu/kali-rolling-amd64.img
warning: host doesn't support requested feature: CPUID.01H:ECX.vmx [bit 5]
autopkgtest [12:31:06]: testbed dpkg architecture: amd64
autopkgtest [12:31:08]: testbed running kernel: Linux 4.19.0-kali5-amd64 #1 SMP Debian 4.19.37-2kali1 (2019-05-15)
autopkgtest [12:31:09]: @@@@@@@@@@@@@@@@@@@@ apt-source thunderbird
Get:1 http://kali.download/kali kali-rolling/main thunderbird 1:60.7.0-1 (dsc) [12.2 kB]
Get:2 http://kali.download/kali kali-rolling/main thunderbird 1:60.7.0-1 (tar) [956 kB]
Get:3 http://kali.download/kali kali-rolling/main thunderbird 1:60.7.0-1 (tar) [9266 kB]
Get:4 http://kali.download/kali kali-rolling/main thunderbird 1:60.7.0-1 (tar) [287 MB]
Get:5 http://kali.download/kali kali-rolling/main thunderbird 1:60.7.0-1 (diff) [555 kB]
gpgv: unknown type of key resource 'trustedkeys.kbx'
gpgv: keyblock resource '/home/debci/.gnupg/trustedkeys.kbx': General error
gpgv: Signature made Thu May 23 16:01:26 2019 UTC
gpgv:                using RSA key B70DFC6F134FECFC011E62AA83016014251D1DB0
gpgv: Can't check signature: No public key
dpkg-source: warning: failed to verify signature on ./thunderbird_60.7.0-1.dsc
autopkgtest [12:36:07]: testing package thunderbird version 1:60.7.0-1
autopkgtest [12:36:07]: build not needed
tar: ./testing/web-platform/tests/websockets/constructor/020.html: Cannot open: No space left on device
tar: ./testing/web-platform/tests/websockets/constructor/004.html: Cannot open: No space left on device
[...]
tar: ./docshell: Cannot mkdir: No space left on device
tar: ./docshell/build/nsDocShellCID.h: Cannot open: No such file or directory
tar: Exiting with failure status due to previous errors
qemu-system-x86_64: terminating on signal 15 from pid 5014 (/usr/bin/python3)
<VirtSubproc>: failure: copyup destination failed, status 2
autopkgtest [12:36:57]: ERROR: testbed failure: cannot send to testbed: [Errno 32] Broken pipe

Cheers,
  Raphaël.

#930719#10
Date:
2019-06-28 07:31:10 UTC
From:
To:
That diagnostic was wrong. The size of the image was not the problem.
In fact the failing "copyup" was copying from the VM back into the host.
We use a "tmpfs" for TMPDIR and /var/lib/debci/tmp so that all autopkgtest
files get copied there (qemu overlay and everything)... but that tmpfs was
badly sized:
# grep debci /etc/fstab
tmpfs		/srv/debci/tmp 	tmpfs	size=300G,nr_inodes=300k,rw,uid=debci,gid=debci,mode=755 0 0

The 300k inodes was too small, thunderbird alone has 327000 files in its
source package. Thus the copy operation was failing. Still, it would be
nice if autopkgtest could avoid to get stuck in such situations...

Cheers,

#930719#15
Date:
2024-09-12 12:33:43 UTC
From:
To:
retitle 930719 Hangs with qemu backend when copyup target partition is full
severity 930719 normal

#930719#20
Date:
2025-09-04 18:06:30 UTC
From:
To:
control: retitle -1 autpkgtest: hangs when copyup fails (with qemu)
control: severity -1 normal

Hi,

By design, autopkgtest tries to put all actions behind one of the
different flavors of timeout (autopkgtest has multiple of those), such
that issues like shouldn't happen. Since commit bdc545d (mid 2023) we at
least have a global timeout that should prevent autopkgtest from hanging
indefinitely.

lib/VirtSubproc.py (which seems to be the culprit here) has seen quite a
lot of changes since version 5.10 (2019), so reproducing this issue is
probably the first step to see if it still exists in the current code.

Paul