#927154 aborts on window size change

#927154#5
Date:
2019-04-15 16:10:22 UTC
From:
To:
Somewhere between 1.20160123 (debian stretch) and 1.20180726, myrepos
acquired this strange behavior that window change signals
(SIGWINCH/28?) seem to crash the program completely, when running in
paralell (-j) and minimal (-m).

A simple reproducer is to start it the program on a large collection
(so it takes some time):

    mr -m -j8 update -q

then resize the window. mr will exit this way:

    $ mr -m -j8 update -q
    mr update: received signal 1

This happens when mr is started from xterm in a tiling window manager
(which immediately resizes the window):

    xterm -hold -e mr -m -j8 update  -q

... although in my experience this is not always fully
reproducible. Better to resize the window by hand, as the signal
might be fired before mr actually starts running (my guess).

Super strange, and breaks my sync scripts so it's really annoying. :)
But I guess it's not release critical...

#927154#10
Date:
2019-04-15 16:55:17 UTC
From:
To:
I've seen this before too but can't reproduce it easily.
Would you mind bisecting to find the commit that causes it?

#927154#15
Date:
2019-04-15 17:21:07 UTC
From:
To:
[2]anarcat@curie:myrepos((57b5fa2...)|BISECTING)$ git bisect good
15e7a50ac16554a8eafe659cc28621fcbb388e5a is the first bad commit
commit 15e7a50ac16554a8eafe659cc28621fcbb388e5a
Author: Paul Wise <pabs3@bonedaddy.net>
Date:   Tue Feb 13 20:45:20 2018 +0800

    Use $pty->getline and $pty->read as loop terminator instead of $pty->is_active

    If the command writes some data, waits a bit, writes some more data and then
    exits quickly then not all of the output will be read because a small amount
    of data will be read in the first loop while the process is still running,
    then the process will have exited and the loop will terminate,
    leaving the remaining output unread.

    Using the read data as loop terminator ensures that it will all be read.

    Found-by: diff -u <(pipetty mr -m status) <(pipetty mr -m status)

:100755 100755 b5495f3029a050ddd6930bb2b9860b1ec324bcb6 1849f0094b65061fed50be24079447b8244a03c9 M	mr

I hope that helps.

A.

#927154#20
Date:
2019-04-16 00:16:39 UTC
From:
To:
Thanks for that.

That commit fixes a pretty major bug, so I think that maybe the issue
was present all along but masked by not using the right loop terminator.

I think it would be interesting to do the bisect again but for each
commit, cherry-pick the loop terminator fix before testing.

#927154#25
Date:
2019-04-16 04:26:34 UTC
From:
To:
Ouch? Any practical way to do that?

It's kind of annoying there's no obvious way to automate the
reproducer. I tried to start it in tmux and it sometimes work, but not
reliably enough to automatically bisect...

And how would i reliable backport that patch systematically within
bisect anyways?

Thanks for any pointers...

a.

#927154#30
Date:
2019-04-16 07:03:54 UTC
From:
To:
I think a normal `git bisect` but running `git cherry-pick` before
doing tests should work, untested though.

Looks like it applies fine with `git cherry-pick` after checking out
1.20160123 so I suspect it will work fine with all the intermediary
commits anyway. To determine if the commit is included in the current
commit during the bisect, you can just `git grep is_active` or ask git
with the `git merge-base --is-ancestor 15e7a50 HEAD` exit code. If the
commit is already included, don't run `git cherry-pick`.

#927154#35
Date:
2019-04-16 17:03:28 UTC
From:
To:
Okay, so here's an interesting data point. On Debian stretch, I can't
reproduce the bug, even when running from git, using on current master
(1.20180726-30-g6cf8003).

So maybe something else is going on here, outside of myrepos itself...

I'll try to reproduce with the cherry-picking when I get back home on
that buster machine later.

A.

#927154#40
Date:
2019-10-13 07:14:55 UTC
From:
To:
Did you manage to work out how to bisect this?

Did you manage to figure out the stretch/buster difference?
I see that libio-pty-perl libio-pty-easy-perl are the same version.

I've found a workaround:

   my $pty = IO::Pty::Easy->new(handle_pty_size => 0);

After editing the IO::Pty::Easy code I found that when IO::Pty::Easy
sets $SIG{WINCH} then the error happens, even if the function it sets
as SIGWINCH handler does nothing at all.

#927154#45
Date:
2019-10-13 08:37:59 UTC
From:
To:
I managed to get a little bit further in the bisect using the attached
script but I hit a point where the commit isn't easily backportable,
I'll try to continue the backporting another day, spent too long today.

#927154#50
Date:
2019-10-13 08:52:55 UTC
From:
To:
I thought about it a bit more logically and I went back to the first
commit that added support for using IO::Pty::Easy to supervise the
output of the command and added a new commit to make it just use read
instead of read and is_active. The result was that I could reproduce
the bug even with that commit. So when running on bullseye the issue
appears to have been there from the beginning.

I guess the next step would be to bisect the dependencies between
stretch and buster, so I guess that is basically Perl only?

#927154#55
Date:
2019-10-14 13:09:22 UTC
From:
To:
Control: tag -1 patch

Hi, could you please recheck this data point? I'm seeing the buggy
behaviour on stretch too with mr from current sid (1.20180726, just
copying the sid version of /usr/bin/mr onto stretch system).

Maybe you didn't have libio-pty-easy-perl installed on the stretch system
as that would silently mask the issue?

As far as I can see, the problem here is that the
  while (my $data = $pty->read()) { ... }
loop assumes EOF when $data is false, and doesn't check for undef.
The IO::Pty::Easy docs say for read():

       Returns "undef" on timeout, the empty string on EOF, or a string
       of at least one character on success (this is consistent with
       "sysread()" and Term::ReadKey).

When SIGWINCH interrupts the blocking read(), mr proceeds to close the pty,
and the child processes die from SIGHUP.

So this doesn't look to me like a regression in Perl or other
dependencies, just a bug in the code.

Maybe something like the attached patch would do? Only lightly tested
but seems to fix it for me.

#927154#62
Date:
2019-10-14 14:42:26 UTC
From:
To:
Control: tags -1 + fixed-upstream
Control: forwarded -1 http://source.myrepos.branchable.com/?p=source.git;a=commitdiff;h=0d8dc60c227cf5868e568c7fa30ebec1f770d521

Untested, but that could definitely explain it.

I guess that since I didn't set a timeout I assumed the read call would
only return either EOF or data not undef.

Any idea why something similar doesn't also happen in the loop above,
where getline would return undef on error, terminating the loop?

That works for me, updated commit message trailer, applied upstream.

#927154#71
Date:
2019-10-14 16:03:09 UTC
From:
To:
getline() has a different interface and restarts interrupted
read(2) calls internally. It's like <> vs. sysread().

#927154#76
Date:
2019-10-15 15:50:04 UTC
From:
To:
This seems to work here, thanks!

a.