#1011646 libthrust: autopkgtest: please be more gentle on ci.d.n infrastructure

#1011646#5
Date:
2022-05-25 20:54:20 UTC
From:
To:
Dear maintainer,

I was checking what was happening on our infrastructure as I was
seeing degraded performance on several architectures, including
several host running out of disk space and even one VM that hang. I
don't have solid evidence that it's all caused by libthrust, but the
results on amd64, arm64 and ppc64el don't inspire confidence that this
package is entirely "innocent".

Please consider making your test suite much less intense. Looking at
our the stats [1] of our big amd64 worker, it really looks like the
test was stressing it so much that we were building up a backlog of
tests, which rarely happens on amd64. Your test on amd64 [2] took 12
hours to come to a "neutral" conclusion because 4 of them timed out
(but marked flaky) and all others failed (while marked flaky) or
passed while marked superficial. That's a poor result for such an
extreme test.

On arm64 and ppc64el your tests seem to tmpfail. I am *suspecting*
that is because they run out of diskspace. All our arm64 and ppc64el
workers have 40 GB disk and run two debci instances in parallel.

For now, I have put libthrust on our rejectlist for those three
architectures and I just flushed the amd64 queue because there were
several libthrust test scheduled and we lack the facilities to remove
individual tests from the queue.

Thanks for using our facilities, but unfortunately we can't support
the tests in their current form.

Paul

[1] https://ci.debian.net/munin/ci-worker13/ci-worker13/index.html

[2] https://ci.debian.net/data/autopkgtest/testing/amd64/libt/libthrust/22073748/log.gz

#1011646#10
Date:
2022-05-27 04:49:21 UTC
From:
To:
We believe that the bug you reported is fixed in the latest version of
libthrust, which is due to be installed in the Debian FTP archive.

A summary of the changes between this version and the previous one is
attached.

Thank you for reporting the bug, which will now be closed.  If you
have further comments please address them to 1011646@bugs.debian.org,
and the maintainer will reopen the bug report if appropriate.

Debian distribution maintenance software
pp.
Andreas Beckmann <anbe@debian.org> (supplier of updated libthrust package)

(This message was generated automatically at their request; if you
believe that there is a problem with it please contact the archive
administrators by mailing ftpmaster@ftp-master.debian.org)
Format: 1.8
Date: Fri, 27 May 2022 06:20:52 +0200
Source: libthrust
Architecture: source
Version: 1.15.0-4
Distribution: unstable
Urgency: medium
Maintainer: Debian NVIDIA Maintainers <pkg-nvidia-devel@lists.alioth.debian.org>
Changed-By: Andreas Beckmann <anbe@debian.org>
Closes: 1011646
Changes:
 libthrust (1.15.0-4) unstable; urgency=medium
 .
   * Reduce amount of autopkgtest tests.  (Closes: #1011646)
Checksums-Sha1:
 583f4c8b00d7ed98c2e26723cc30bfe4be8bd96a 2116 libthrust_1.15.0-4.dsc
 a44ad7257af07551789dd28576fb21c107bec7b3 7572 libthrust_1.15.0-4.debian.tar.xz
 d5df591f934cf9bbdf947404a626950609090d59 6284 libthrust_1.15.0-4_source.buildinfo
Checksums-Sha256:
 82cd1962124d04ec04790c7890712afb77496650325f99914c3cb25a1df7a6a1 2116 libthrust_1.15.0-4.dsc
 ec460ee5193357d12514f1d18ed8127ce4a92375886807479221d8566f5fcc20 7572 libthrust_1.15.0-4.debian.tar.xz
 9c1f42557bec5561f32d9121cf3b20e187821b99106db740d4f64d6fe485acb5 6284 libthrust_1.15.0-4_source.buildinfo
Files:
 c1df10cf6d6e2daf47e52c458c296bda 2116 libdevel optional libthrust_1.15.0-4.dsc
 adaabd0ac20b5117fc4f88122c0ca7b4 7572 libdevel optional libthrust_1.15.0-4.debian.tar.xz
 5d8732e4bb9262ecb852a66a86338e95 6284 libdevel optional libthrust_1.15.0-4_source.buildinfo
-----BEGIN PGP SIGNATURE-----

iQJEBAEBCAAuFiEE6/MKMKjZxjvaRMaUX7M/k1np7QgFAmKQVJEQHGFuYmVAZGVi
aWFuLm9yZwAKCRBfsz+TWentCLFBEACiXKB/fSjUZrdmlERRKdK3E1/lHk2mt45w
brjnt9Ld99jzWte0Y0hWHqmSZ7vG3aheC6hxAQtUvsEjZzbXX8vhPr5xo6RVPKgC
jLpg9WVtIirbRDoZhY4t43FIrTvydR8Hgcr1I8zZCY/F9HUnlakjnvjdFZW+Mm5P
rjGnhNOX0pIYcArRN68rg7FA4Agsx9kfZzN9pnbLG2I2yM1cytW5o5lvlvaml11v
XmjB0NMApIyT/MgfToT1CPEeYUvg9fMOqpw6rkTXiTxDc9wKYfuKxjc6883e+Y9K
16B4sGMzVp/x/e0Qfjb6s8TjchLQuRIpjQ/lK/ucOnAmkyj/wOuqmmQdHqJfm0Q6
ADdD6TZ62284t52ucHs1TP1t4TyokIwGhDIOpHxEyNd29sBiPIcixsiod2mV+SaQ
4cKseVW7Ckm7dWoPivJtSvuKFBXSO86DfhALfvYbgurYt5ocWNCinmEy9K92tpwE
JfrabZjYO3TJBagDTVBpaQPJMNkqxdj64HQPAHM8osnOny2dic+Od7FX3NLzQxNl
WqyFHxuIjCTd4iC18TGEdKAvfRU9yqz7TflIOKvpD8ZCINKFseRy0ybEQ/MrZwMo
eUS+izFH6+r40xe6kJdTBGwfiISBm+wKAYgxZZo5EGyGkuR0eptUt/o5HwBr1IB8
IiXqNczReQ==
=yfXy
-----END PGP SIGNATURE-----

#1011646#15
Date:
2022-05-27 15:17:21 UTC
From:
To:
Upstream comes with a rather exhaustive test matrix ...
I'm now running a smaller subset of it and in a more fine grained way to
make it easier to decide which parts to skip as well.
The OpenMP parts were even slower than on my machine ... so far I hadn't
seen invididual tests time out in ctest. Synchronizing between 56
threads seems to be a hard job ;-)

With the more fine grained testing I've tried to free disk space at the
end of the smaller test chinks, not sure it that was successful.
probably starting with amd64, to see whether I managed to get it down to
an acceptable size.

Hmm, a first test for -4 on amd64 has already finished (so the
blacklisting did not work?), mostly telling me 'SKIP test name may not
contain / character' (that should be checked by lintian). Preparing -5
now ...

The tests are temporarily all on flaky to avoid introducing regressions
while testing the tests ;-)

The tests are all superficial, since we can't run (but only compile) the
most relevant part of the testsuite: the cuda tests.

Andreas

PS: src:cub needs to be trimmed down as well, that has done some 12 h
runs on ci-worker13, too ... not touching that before we have resolved
src:libthrust

#1011646#20
Date:
2022-05-27 15:20:39 UTC
From:
To:
Hi

I triggered that.

Paul

#1011646#25
Date:
2022-06-02 19:45:00 UTC
From:
To:
OK. Could you give -5 a try? That should have valid test names ...

Thanks.

BTW, nvidia-cuda-toolkit currently seems stuck:

autopkgtest for nvidia-cuda-toolkit/blocked-on-ci-infra: ppc64el: Regression

Is this related to the libthrust blocks?


Andreas

#1011646#30
Date:
2022-06-02 19:53:44 UTC
From:
To:
Hi

That has already happened automatically, as the block has been removed.

No, that's because it has it's own block [1]:

nvidia-cuda-toolkit 	All 	ppc64el 	* 	test suite fails to start properly
(disk space in unstable)

Sorry, I forgot to file a bug about that.

[1] https://ci.debian.net/status/reject_list/

Paul

#1011646#35
Date:
2022-06-04 03:49:06 UTC
From:
To:
arm64 still times out on the cuda parts ... :-(
waiting for the ppc64el run ...

Can you hint against that?
That can't even be prevented with an Architecture setting.

I've now copied the autopkgtest from src:nvidia-cuda-toolkit to
src:pycuda - all we need is an installation of nvidia-cuda-toolkit, and
that is < 5 GB. If that works, I'll drop the autopkgtest from
src:nvidia-cuda-toolkit again.
(pycuda didn't have any tests, yet, (its testsuite would want to run
cuda code on a gpu), and the new test is fast enough to be run on salsa-ci)

Andreas

PS: I've updated src:cub and reduced the autopkgtest by 50%. It still
fails early on arm64/ppc64el (due to some char/uchar mess), I'll take
that upstream once I'm running on the latest upstream release.

#1011646#40
Date:
2022-06-04 05:18:47 UTC
From:
To:
Hi Andreas,

The ppc64el one finished while I was checking. It also timed out and
took in total 8 hours.

Yes, but only because of the explanation you gave below (too big source).

Hmm, I think that could be something that autopkgtest could check before
starting, apt knows about this, doesn't it?

Ack, hence not blaming nvidia-cuda-toolkit anymore.

Ack.

Paul

#1011646#45
Date:
2022-06-11 20:27:07 UTC
From:
To:
Hi Paul,

with the latest tuned autopkgtest versions of src:libthrust and src:cub
having migrated to testing, could you check whether these packages now
only put an "acceptable" load on the CI infrastructure?

(the src:nvidia-cuda-toolkit autopkgtest will be dropped with the next
upload, src:pycuda should have equivalent tests now while requiring no
extraordinary space to unpack :-)

Andreas

#1011646#50
Date:
2022-06-12 20:46:18 UTC
From:
To:
Hi Andreas,
that sense...

Both packages still only results neutral (with flaky skips):
[cub]
cmake_find_package_CUB PASS (superficial)
compile_testsuite_cuda-g++_C++17 PASS (superficial)
compile_testsuite_cuda-g++_C++14 PASS (superficial)
compile_testsuite_g++-11_C++17 FLAKY non-zero exit status 2
compile_testsuite_g++-10_C++17 SKIP exit status 77 and marked as skippable
[libthrust]
cmake_find_package_Thrust PASS (superficial)
run_testsuite_CPP_C++17_g++-12 PASS (superficial)
run_testsuite_CPP_C++17_g++-11 FLAKY non-zero exit status 8
run_testsuite_CPP_C++17_g++-10 PASS (superficial)
run_testsuite_CPP_C++14 FLAKY non-zero exit status 8
run_testsuite_TBB_C++17 PASS (superficial)
compile_testsuite_CPP_CUDA_C++17_cuda-g++ PASS (superficial)
compile_testsuite_TBB_CUDA_C++17_cuda-g++ PASS (superficial)
compile_testsuite_CPP_CUDA_C++17_g++-11 FLAKY non-zero exit status 2
compile_testsuite_TBB_CUDA_C++17_g++-11 FLAKY non-zero exit status 2
compile_testsuite_CPP_CUDA_C++17_g++-10 SKIP exit status 77 and marked
as skippable
compile_testsuite_TBB_CUDA_C++17_g++-10 SKIP exit status 77 and marked
as skippable

That's a bit disappointing for a test that takes around 5 to 7 hours
(but better than before).. Alas.

Please ping me when that future upload migrates, than I can drop my
entry in the reject list.

Paul

#1011646#55
Date:
2022-07-10 02:11:50 UTC
From:
To:
After yesterdays point release all nvidia-cuda-toolkit autopkgtests
should be gone from the archive. (buster-backports to experimental).


Andreas