Somewhere between 1.20160123 (debian stretch) and 1.20180726, myrepos
acquired this strange behavior that window change signals
(SIGWINCH/28?) seem to crash the program completely, when running in
paralell (-j) and minimal (-m).
A simple reproducer is to start it the program on a large collection
(so it takes some time):
mr -m -j8 update -q
then resize the window. mr will exit this way:
$ mr -m -j8 update -q
mr update: received signal 1
This happens when mr is started from xterm in a tiling window manager
(which immediately resizes the window):
xterm -hold -e mr -m -j8 update -q
... although in my experience this is not always fully
reproducible. Better to resize the window by hand, as the signal
might be fired before mr actually starts running (my guess).
Super strange, and breaks my sync scripts so it's really annoying. :)
But I guess it's not release critical...
I've seen this before too but can't reproduce it easily. Would you mind bisecting to find the commit that causes it?
[2]anarcat@curie:myrepos((57b5fa2...)|BISECTING)$ git bisect good
15e7a50ac16554a8eafe659cc28621fcbb388e5a is the first bad commit
commit 15e7a50ac16554a8eafe659cc28621fcbb388e5a
Author: Paul Wise <pabs3@bonedaddy.net>
Date: Tue Feb 13 20:45:20 2018 +0800
Use $pty->getline and $pty->read as loop terminator instead of $pty->is_active
If the command writes some data, waits a bit, writes some more data and then
exits quickly then not all of the output will be read because a small amount
of data will be read in the first loop while the process is still running,
then the process will have exited and the loop will terminate,
leaving the remaining output unread.
Using the read data as loop terminator ensures that it will all be read.
Found-by: diff -u <(pipetty mr -m status) <(pipetty mr -m status)
:100755 100755 b5495f3029a050ddd6930bb2b9860b1ec324bcb6 1849f0094b65061fed50be24079447b8244a03c9 M mr
I hope that helps.
A.
Thanks for that. That commit fixes a pretty major bug, so I think that maybe the issue was present all along but masked by not using the right loop terminator. I think it would be interesting to do the bisect again but for each commit, cherry-pick the loop terminator fix before testing.
Ouch? Any practical way to do that? It's kind of annoying there's no obvious way to automate the reproducer. I tried to start it in tmux and it sometimes work, but not reliably enough to automatically bisect... And how would i reliable backport that patch systematically within bisect anyways? Thanks for any pointers... a.
I think a normal `git bisect` but running `git cherry-pick` before doing tests should work, untested though. Looks like it applies fine with `git cherry-pick` after checking out 1.20160123 so I suspect it will work fine with all the intermediary commits anyway. To determine if the commit is included in the current commit during the bisect, you can just `git grep is_active` or ask git with the `git merge-base --is-ancestor 15e7a50 HEAD` exit code. If the commit is already included, don't run `git cherry-pick`.
Okay, so here's an interesting data point. On Debian stretch, I can't reproduce the bug, even when running from git, using on current master (1.20180726-30-g6cf8003). So maybe something else is going on here, outside of myrepos itself... I'll try to reproduce with the cherry-picking when I get back home on that buster machine later. A.
Did you manage to work out how to bisect this?
Did you manage to figure out the stretch/buster difference?
I see that libio-pty-perl libio-pty-easy-perl are the same version.
I've found a workaround:
my $pty = IO::Pty::Easy->new(handle_pty_size => 0);
After editing the IO::Pty::Easy code I found that when IO::Pty::Easy
sets $SIG{WINCH} then the error happens, even if the function it sets
as SIGWINCH handler does nothing at all.
I managed to get a little bit further in the bisect using the attached script but I hit a point where the commit isn't easily backportable, I'll try to continue the backporting another day, spent too long today.
I thought about it a bit more logically and I went back to the first commit that added support for using IO::Pty::Easy to supervise the output of the command and added a new commit to make it just use read instead of read and is_active. The result was that I could reproduce the bug even with that commit. So when running on bullseye the issue appears to have been there from the beginning. I guess the next step would be to bisect the dependencies between stretch and buster, so I guess that is basically Perl only?
Control: tag -1 patch
Hi, could you please recheck this data point? I'm seeing the buggy
behaviour on stretch too with mr from current sid (1.20180726, just
copying the sid version of /usr/bin/mr onto stretch system).
Maybe you didn't have libio-pty-easy-perl installed on the stretch system
as that would silently mask the issue?
As far as I can see, the problem here is that the
while (my $data = $pty->read()) { ... }
loop assumes EOF when $data is false, and doesn't check for undef.
The IO::Pty::Easy docs say for read():
Returns "undef" on timeout, the empty string on EOF, or a string
of at least one character on success (this is consistent with
"sysread()" and Term::ReadKey).
When SIGWINCH interrupts the blocking read(), mr proceeds to close the pty,
and the child processes die from SIGHUP.
So this doesn't look to me like a regression in Perl or other
dependencies, just a bug in the code.
Maybe something like the attached patch would do? Only lightly tested
but seems to fix it for me.
Control: tags -1 + fixed-upstream Control: forwarded -1 http://source.myrepos.branchable.com/?p=source.git;a=commitdiff;h=0d8dc60c227cf5868e568c7fa30ebec1f770d521 Untested, but that could definitely explain it. I guess that since I didn't set a timeout I assumed the read call would only return either EOF or data not undef. Any idea why something similar doesn't also happen in the loop above, where getline would return undef on error, terminating the loop? That works for me, updated commit message trailer, applied upstream.
getline() has a different interface and restarts interrupted read(2) calls internally. It's like <> vs. sysread().
This seems to work here, thanks! a.