#1012318 diffoscope 214 produced no output and was killed after running into timeout after 150m #1012318
- Package:
- diffoscope
- Source:
- diffoscope
- Submitter:
- Roman Lebedev
- Date:
- 2022-06-08 13:54:03 UTC
- Severity:
- normal
Dear maintainer, it would appear, diffoscope is failing when runing agains halide package: https://tests.reproducible-builds.org/debian/rb-pkg/unstable/amd64/diffoscope-results/halide.html Unfortunately, i do not know what the more specific problem is. Roman. - -- System Information: Debian Release: bookworm/sid APT prefers unstable APT policy: (990, 'unstable'), (500, 'unstable-debug'), (1, 'experimental-debug'), (1, 'experimental') Architecture: amd64 (x86_64) Kernel: Linux 5.17.0-3-amd64 (SMP w/32 CPU threads; PREEMPT) Locale: LANG=en_US.UTF-8, LC_CTYPE=en_US.UTF-8 (charmap=UTF-8), LANGUAGE=en_US Shell: /bin/sh linked to /usr/bin/dash Init: systemd (via /run/systemd/system) LSM: AppArmor: enabled Versions of packages diffoscope depends on: pn diffoscope-minimal <none> Versions of packages diffoscope recommends: pn abootimg <none> ii acl 2.3.1-1 pn androguard <none> pn apksigner <none> pn apktool <none> pn binutils-multiarch <none> ii bzip2 1.0.8-5 pn caca-utils <none> pn colord <none> pn db-util <none> pn default-jdk-headless | default-jdk | java-sdk <none> pn device-tree-compiler <none> pn docx2txt <none> ii e2fsprogs 1.46.5-2 pn enjarify <none> ii ffmpeg 10:4.4.2-dmo5 pn fontforge-extras <none> pn fp-utils <none> ii genisoimage 9:1.1.11-3.2 ii gettext 0.21-6 pn ghc <none> ii ghostscript 9.56.1~dfsg-1 pn giflib-tools <none> pn gnumeric <none> ii gnupg 2.2.35-2 ii gnupg-utils 2.2.35-2 pn hdf5-tools <none> ii imagemagick-6.q16 [imagemagick] 8:6.9.11.60+dfsg-1.3+b2 pn jsbeautifier <none> ii libarchive-tools 3.6.0-1 pn libxmlb-dev <none> ii llvm 1:15.0-55~20220203224054.8 ii lz4 [liblz4-tool] 1.9.3-2 pn mono-utils <none> pn ocaml-nox <none> pn odt2txt <none> pn oggvideotools <none> ii openssh-client 1:9.0p1-1+b1 ii openssl 3.0.3-5 pn pgpdump <none> ii poppler-utils 22.02.0-3 pn procyon-decompiler <none> pn python3-argcomplete <none> pn python3-binwalk <none> ii python3-debian 0.1.44 ii python3-defusedxml 0.7.1-1 pn python3-guestfs <none> pn python3-jsondiff <none> pn python3-pdfminer <none> pn python3-progressbar <none> pn python3-pypdf2 <none> pn python3-pyxattr <none> pn python3-rpm <none> pn python3-tlsh <none> pn r-base-core <none> pn radare2 <none> ii rpm2cpio 4.17.0+dfsg1-4 pn sng <none> ii sqlite3 3.38.5-1 pn squashfs-tools <none> pn tcpdump <none> pn u-boot-tools <none> ii unzip 6.0-26 ii vim-common 2:8.2.4793-1 pn wabt <none> pn xmlbeans <none> ii xxd 2:8.2.4793-1 ii xz-utils 5.2.5-2.1 ii zip 3.0-12 ii zstd 1.5.2+dfsg-1 Versions of packages diffoscope suggests: ii libjs-jquery 3.6.0+dfsg+~3.5.13-1 -----BEGIN PGP SIGNATURE----- iQIzBAEBCgAdFiEE+UikBxeiu50LOdFYgbqkFMWfZdAFAmKahZ0ACgkQgbqkFMWf ZdCizhAAxer8KRHDg7wmjBNf9zK1HH4Oyf/KfRLHW22JsZlMhkKjp449H6E3R2q5 mAlyGE/5L9j6mGFRhxZaGWqxCin54EA0lPZzDgIpHIpLeYZbBP9QUASI3nj+0IfX 5yTkS0vz2SNVWt7/GxNBGUb19vu3fb/jounZyCiWLrP5cU4QhM/nEOgdcvvyllTH kggEARqBfSQDk0bgkeh8yO3NW6swp97BbP18K6LdnPmKUftSHChePjQo27a0OltZ ilO4gRym4AIlUTsDsZ98a695Hux5SKmn7Lsok0oYdBumge2cHeFvb7X5FLOwriv6 OnSxjR4sKQROITRjp2/cqijzIsRqBd2SN09yXvOKusZn0yIjTDgDer5UQfzhesFX NSDyTJgqR5FD5E6f5NaZPl9paw1+c7sgtGEN8bIv4kploeHgME23HKQKxVHt18o0 1AbCKDvcJBt4M7xksYrz4nfRa4XyRzpbEm0oYzQutgZL/+Osr/y9tdaNYXsSYJMu o9nJcnb5J+jc3JBIYDdTz8p4xgvyq0aPx1vOBZM6/E/q13EQRhF03DTGWAWK3ext kJMR4HasD3J22YqgJ61S02mYTHnaLezgHSMJbRKm04tMiOjXUYtwOljpNmjiDJAy xYoBr5KGeAUvKv58InsQY04pSxivjBuBHl563JQP7zvd85TKYCU= =lBAV -----END PGP SIGNATURE-----
Some things just take a long time to run diffoscope, and tests.reproducible-builds.org only lets diffoscope run for 150m and then kills it to avoid very large unreproducible packages from eating all the resources. Options would be to run diffoscope on two builds of halide locally. The tool reprotest does this and runs diffoscope by default if it is installed. There is also beta.tests.reproducible-builds.org which may not have a diffoscope timeout, so you might be able to find diffoscope output for halide there. There may be improvements that could make diffoscope run faster and avoid the timeout, though I'm not sure this is a fundamental problem of diffoscope. I'll see if I can find you some diffoscope output for halide by running it on a fairly powerful machine I have access to... live well, vagrant
... https://people.debian.org/~vagrant/reproducible/halide.diffoscope.out.zst It's under 200MB download, expands to ~2GB uncompresed... live well, vagrant
Vagrant Cascadian wrote: (I do see that Vagrant has sent a diff in a subsequent email in this thread...) For the benefit of Roman, there are actually two levels of timeouts involved when running diffoscope on tests.reproducible-builds.org: First, we pass a value to diffoscope's --timeout option:
Nevertheless, it could get stuck on some arbitrary file, no? Or does it
actually try to abort processing a file if it is taking too long?
Yeah, I think maybe we should add more time to the hard jenkins timeout
to make sure it really gets *some* output.
The original was just the defaults that reprotest uses (perhaps we
should add --profile?).
I re-ran diffoscope with profile output using the same artifacts....
# Profiling output for: /usr/bin/diffoscope --progress --profile --text=halide.diffoscope control/ experiment-1
## close_archive (total time: 0.001s)
0.000s 24 calls diffoscope.comparators.xz.XzContainer
0.000s 22 calls diffoscope.comparators.tar.TarContainer
0.000s 6 calls diffoscope.comparators.deb.DebContainer
0.000s 10 calls diffoscope.comparators.deb.DebTarContainer
## command (total time: 1816.184s)
805.925s 8949 calls readelf
770.865s 2011 calls diff
232.245s 1188 calls objdump
5.085s 36 calls xz
2.002s 198 calls strings
0.026s 2 calls cmp
0.026s 2 calls cmp (external)
0.004s 54 calls cmp (internal)
0.003s 66 calls stat
0.002s 18 calls objcopy
## compare_files (cumulative) (total time: 3904.923s)
922.891s 6 calls abc.DebFile
922.341s 12 calls abc.XzFile
916.800s 6 calls abc.DebDataTarFile
873.202s 18 calls abc.ElfFile
183.319s 68 calls diffoscope.comparators.elf.ElfCodeSection
46.212s 338 calls diffoscope.comparators.elf.ElfSection
29.528s 1047 calls abc.TextFile
10.056s 34 calls diffoscope.comparators.elf.ElfStringSection
0.446s 6 calls abc.TarFile
0.126s 6 calls abc.Md5sumsFile
0.000s 1 call diffoscope.comparators.directory.FilesystemDirectory
## container_extract (total time: 6.318s)
5.089s 36 calls diffoscope.comparators.xz.XzContainer
1.109s 6388 calls diffoscope.comparators.deb.DebTarContainer
0.098s 48 calls diffoscope.comparators.deb.DebContainer
0.022s 40 calls diffoscope.comparators.tar.TarContainer
## has_same_content_as (total time: 0.109s)
0.071s 1057 calls abc.TextFile
0.016s 9 calls abc.DebFile
0.013s 6 calls abc.DebDataTarFile
0.003s 338 calls diffoscope.comparators.elf.ElfSection
0.002s 18 calls abc.ElfFile
0.001s 12 calls abc.Md5sumsFile
0.001s 12 calls abc.XzFile
0.001s 6 calls abc.TarFile
0.001s 68 calls diffoscope.comparators.elf.ElfCodeSection
0.000s 3 calls diffoscope.comparators.utils.libarchive.LibarchiveSymlink
0.000s 34 calls diffoscope.comparators.elf.ElfStringSection
0.000s 1 call diffoscope.comparators.directory.FilesystemDirectory
## main (total time: 935.422s)
935.417s 2 calls outputs
0.005s 1 call cleanup
## open_archive (total time: 0.001s)
0.000s 36 calls diffoscope.comparators.xz.XzContainer
0.000s 24 calls diffoscope.comparators.tar.TarContainer
0.000s 12 calls diffoscope.comparators.deb.DebTarContainer
0.000s 12 calls diffoscope.comparators.deb.DebContainer
## output (total time: 12.131s)
12.131s 1 call text
## recognizes (total time: 8.809s)
7.074s 51970 calls diffoscope.comparators.utils.libarchive.LibarchiveMember
1.111s 7436 calls diffoscope.comparators.elf.ElfSection
0.240s 1496 calls diffoscope.comparators.elf.ElfCodeSection
0.221s 1488 calls diffoscope.comparators.utils.archive.ArchiveMember
0.119s 748 calls diffoscope.comparators.elf.ElfStringSection
0.043s 468 calls diffoscope.comparators.binary.FilesystemFile
0.001s 96 calls abc.Md5sumsFile
0.000s 12 calls diffoscope.comparators.utils.libarchive.LibarchiveSymlink
## specialize (total time: 9.476s)
9.476s 1564 calls specialize
And again for html generation:
# Profiling output for: /usr/bin/diffoscope --progress --profile --html halide.diffoscope.out.html control/ experiment-1
## close_archive (total time: 0.001s)
0.000s 24 calls diffoscope.comparators.tar.TarContainer
0.000s 36 calls diffoscope.comparators.xz.XzContainer
0.000s 12 calls diffoscope.comparators.deb.DebContainer
0.000s 12 calls diffoscope.comparators.deb.DebTarContainer
## command (total time: 1784.837s)
784.955s 8949 calls readelf
757.622s 2011 calls diff
235.317s 1188 calls objdump
4.824s 36 calls xz
2.056s 198 calls strings
0.027s 2 calls cmp
0.027s 2 calls cmp (external)
0.004s 54 calls cmp (internal)
0.003s 66 calls stat
0.002s 18 calls objcopy
## compare_files (cumulative) (total time: 3866.521s)
915.357s 6 calls abc.DebFile
914.809s 12 calls abc.XzFile
909.520s 6 calls abc.DebDataTarFile
856.681s 18 calls abc.ElfFile
184.384s 68 calls diffoscope.comparators.elf.ElfCodeSection
46.102s 338 calls diffoscope.comparators.elf.ElfSection
29.150s 1047 calls abc.TextFile
9.956s 34 calls diffoscope.comparators.elf.ElfStringSection
0.440s 6 calls abc.TarFile
0.123s 6 calls abc.Md5sumsFile
0.001s 1 call diffoscope.comparators.directory.FilesystemDirectory
## container_extract (total time: 6.051s)
4.827s 36 calls diffoscope.comparators.xz.XzContainer
1.104s 6388 calls diffoscope.comparators.deb.DebTarContainer
0.096s 48 calls diffoscope.comparators.deb.DebContainer
0.023s 40 calls diffoscope.comparators.tar.TarContainer
## diff (total time: 15.681s)
15.681s 7207 calls linediff
## has_same_content_as (total time: 0.109s)
0.070s 1057 calls abc.TextFile
0.016s 6 calls abc.DebDataTarFile
0.014s 9 calls abc.DebFile
0.003s 338 calls diffoscope.comparators.elf.ElfSection
0.002s 18 calls abc.ElfFile
0.001s 12 calls abc.Md5sumsFile
0.001s 12 calls abc.XzFile
0.001s 6 calls abc.TarFile
0.001s 68 calls diffoscope.comparators.elf.ElfCodeSection
0.000s 3 calls diffoscope.comparators.utils.libarchive.LibarchiveSymlink
0.000s 34 calls diffoscope.comparators.elf.ElfStringSection
0.000s 1 call diffoscope.comparators.directory.FilesystemDirectory
## main (total time: 1329.326s)
1329.326s 2 calls outputs
0.000s 1 call cleanup
## open_archive (total time: 0.001s)
0.000s 36 calls diffoscope.comparators.xz.XzContainer
0.000s 24 calls diffoscope.comparators.tar.TarContainer
0.000s 12 calls diffoscope.comparators.deb.DebTarContainer
0.000s 12 calls diffoscope.comparators.deb.DebContainer
## output (total time: 413.571s)
413.571s 1 call html
## recognizes (total time: 8.921s)
7.049s 51970 calls diffoscope.comparators.utils.libarchive.LibarchiveMember
1.216s 7436 calls diffoscope.comparators.elf.ElfSection
0.255s 1496 calls diffoscope.comparators.elf.ElfCodeSection
0.224s 1488 calls diffoscope.comparators.utils.archive.ArchiveMember
0.128s 748 calls diffoscope.comparators.elf.ElfStringSection
0.047s 468 calls diffoscope.comparators.binary.FilesystemFile
0.001s 96 calls abc.Md5sumsFile
0.000s 12 calls diffoscope.comparators.utils.libarchive.LibarchiveSymlink
## specialize (total time: 9.574s)
9.574s 1564 calls specialize
Just mentioning that we already call diffoscope with --profile on jenkins, just that if it gets killed hard it won't print it out. Furthermore, there are 6 minutes (i.e. 0.1 hours) between SIGTERM and SIGKILL. I don't know at which stage of this case diffoscope was terminated, but perhaps it's one side that could be investiage: to make diffoscope output something on SIGTERM when it still has a chance to continue running. vs So it's not really the presenter that takes excessively. But otoh 413 seconds are more than 6 minutes. If somebody looks into the point above I can change jenkins to wait 12 minutes (or 18) before KILLing after the TERM signal.
Vagrant Cascadian wrote: You are right — diffoscope could get stuck processing some arbitrary file and --timeout does not cause diffoscope abort halfway through processing a file (or, more likely, a call to /usr/bin/readelf!) if it's taking too long. I have tried to make this pretty clear in the `--help` output as well as clarifying that in any Salsa issue involving the --timeout flag but it still seems to be causing some confusion, alas. :) If it's okay with you, I'll save an explanation of why this would be difficult to implement within diffoscope for another venue, and not in this (slightly) unrelated Debian bug report. Either way, getting stuck on a particular file does not appear to be the problem in this case. Assuming that the issue is not specific to Jenkins' build of halide, you appear to be able to run it successfully to the end... Best wishes,
Mattia Rizzolo wrote: I'm not quite sure what connection you are making between the signal processing within diffoscope and HTML output generation. When diffoscope receives a TERM signal, it only attempts to clean up temporary files, it does not attempt to try and write any HTML output. See sigterm_handler in diffoscope/main.py, etc.. (I'm not entirely sure it should attempt to do so either. Signals are weird.) We could certainly try and print the profiling output if we receive a TERM though... aaaand I've pushed something to that effect to master. I've also added some logging warnings so it's much clearer if we are in that state. I'm +1 for increasing the hard upper limit on Jenkins and could be convinced to increase this TERM/KILL gap too, but as I mention above, I'm not sure we're following the same rationale yet. Regards,
Thank you for reminding me what the handler was doing! I wrote my email yesterday without actually looking, I just vaguely remembered it was doing _something_! Oh, yes, that's probably what it ought to do indeed (logging that it received a signal and leave some traces of itself on stdout, at least). Before raising the KILL limit diffoscope should learn how to treat TERM (and/or probably other signals too?) in a way that makes it at least terminating its current task (i.e. killing subprocesses, and start to render the output). But indeed, should diffoscope really do that? Overall, I agree we don't have a bug here.
Hey Mattia, Hm, by "leave some traces of itself on stdout", what exactly do you mean by that? You appear to be implying it should do more than simply logging it has received a signal and is cleaning up after itself? Hmm. Okay, let me spell out some of the downsides of doing this, if only as a way of explaining them back to myself (!) and seeing how much of a problem they really are: * It's unclear whether this fits the semantics of the TERM signal. As you yourself ask in your reply, it is unclear whether diffoscope *should* actually do this. (If I were on the command-line and hit CTRL+C, I'm not entirely sure I'd want it to stop performing a diff and start opening files to write HTML...) * Writing "signal-safe" code is fraught with weirdness, and this is compounded by us using a scripting language that partly abstracts away signal handling. (In other words, *could* we reliably do this? I'm not sure.) * Related to the previous point, the code paths for this particular mode would be difficult to test, and even if we had good testing coverage, they would not be exercised very often in real life. Re-creating issues arising from supporting this would likely be almost impossible — something breaking on Jenkins would be difficult to reproduce locally. * Things like subprocesses should be reaped anyway, at least according to my understanding about how processes work. * It's all a bit of a workaround to "diffoscope being slow". Or, rather, there are higher-priority things that will avoid us reaching the timeout in the first place. (For instance, improving the overall speed of ELF handling.) (Sorry for the sloppy braindump; again, it's not intended as a rejoinder to what you wrote...) Best wishes,
Sorry for being unclear. I meant that what you commit is exactly what it should have done from the start. Currently (before your commits) diffoscope would not print anything out after receiving TERM, save from some .debug() output about cleaning up temp files (which are not printed by default). Right. That's exactly why I also mentioned other signals, which might have better semantics for this case (like, HUP, which nowadays is pretty much process-specific). (besides, doesn't CTRL+C send INT, not TERM?) But I agree with this, so my musing above are probably better discarded at this time, and consider this bug closed :)
Mattia Rizzolo wrote: (Yes, whoops, you are absolutely right.) Yes, I should say so. Please go ahead... Best wishes,
Thus, closing. Chris committed a couple of commits to have a nicer log on the console when diffoscope is killed, but regardless of that, that diffoscope was killed after a timeout on jenkins is not an intrisic bug of diffoscope.