#593504 mawk: Does not produce incremental output, even with fflush() calls

Package:
mawk
Source:
mawk
Description:
Pattern scanning and text processing language
Submitter:
Josh Triplett
Date:
2017-10-15 05:09:03 UTC
Severity:
wishlist
#593504#3
Date:
2010-08-18 19:20:02 UTC
From:
To:
$ for i in `seq 10` ; do echo x ; sleep 1 ; done | gawk '{ print $1 ; fflush() }' | ts
Aug 18 12:16:39 x
Aug 18 12:16:40 x
Aug 18 12:16:41 x
Aug 18 12:16:42 x
Aug 18 12:16:43 x
Aug 18 12:16:44 x
Aug 18 12:16:45 x
Aug 18 12:16:46 x
Aug 18 12:16:47 x
Aug 18 12:16:48 x
$ for i in `seq 10` ; do echo x ; sleep 1 ; done | mawk '{ print $1 ; fflush() }' | ts
Aug 18 12:17:14 x
Aug 18 12:17:14 x
Aug 18 12:17:14 x
Aug 18 12:17:14 x
Aug 18 12:17:14 x
Aug 18 12:17:14 x
Aug 18 12:17:14 x
Aug 18 12:17:14 x
Aug 18 12:17:14 x
Aug 18 12:17:14 x


(Piping to ts makes stdout buffered, hence the fflush(); the same
behavior occurs if I don't pipe to ts and omit the fflush(), but then I
don't have the timestamps to make the problem obvious.)

- Josh Triplett

#593504#8
Date:
2010-08-18 20:59:31 UTC
From:
To:
Is "ts" a package program, or one of your own scripts?

(this is probably fixed upstream, but I'd like to verify that)

#593504#13
Date:
2010-08-18 21:44:51 UTC
From:
To:
It's in the Debian package 'moreutils'.

Thanks,
Jamey

#593504#18
Date:
2010-08-18 22:14:35 UTC
From:
To:
thanks - I see that it's not fixed in upstream (there was a fix to
buffering which affects a different case).

#593504#23
Date:
2010-08-18 23:03:24 UTC
From:
To:
tags 593504 + upstream
quit

Thomas Dickey wrote:

Tagging accordingly.

#593504#30
Date:
2012-11-29 01:40:07 UTC
From:
To:
I happened to revisit this one, and realized it's not a bug per se.

gawk does a fflush after each write to stdout when stdout happens to
be a tty.

man mawk provides this option:

       -W interactive sets unbuffered writes to stdout and line buffered reads
                      from stdin.  Records from stdin are lines regardless  of
                      the value of RS.

Given that mawk's trying to be fast, doing what gawk is doing would
detract from that goal.  The option's there for people who want it.

#593504#37
Date:
2012-11-29 16:09:49 UTC
From:
To:
produce incremental output.

- Josh Triplett

#593504#42
Date:
2012-11-30 00:06:56 UTC
From:
To:
mawk's doing a flush as requested - buffered I/O works a little
differently than just that - go back and read my comment rather than
reacting to it.  I did read the source code for gawk, of course.

I've attached traces using ltrace to demonstrate that the effect is
because gawk does multiple fflush's because it does multiple writes.

#593504#47
Date:
2012-11-30 00:40:40 UTC
From:
To:
By the way, the traces also show that mawk is making fewer library calls
than gawk.  Current mawk uses more calls than the 1.3.3 version,
mostly in the command-line parsing so that mawk knows the difference between

	-W interactive
	-W ignore

(though you'll see a couple more sprinkled in the traces, due to more
important bug-fixes).  One might argue for making it read command-line
options from an environment variable as I did for dialog - that in turn
would cost a little more speed - but since there's no progress in fixing
Debian's mawk package, that's an academic point.

#593504#52
Date:
2017-10-15 05:05:49 UTC
From:
To:
Resurrecting old bug thread because this hasn't yet been resolved.

This report was about an unexpected behavior of fflush(). The below was
expected to print() once per second, but instead it looks like it
printed all at once after all the sleeps:

$ for i in `seq 10` ; do echo x ; sleep 1 ; done | mawk '{ print $1 ; fflush() }' | ts
Aug 18 12:17:14 x
Aug 18 12:17:14 x
Aug 18 12:17:14 x
Aug 18 12:17:14 x
Aug 18 12:17:14 x
Aug 18 12:17:14 x
Aug 18 12:17:14 x
Aug 18 12:17:14 x
Aug 18 12:17:14 x
Aug 18 12:17:14 x

I ran a few tests, and it looks like fflush() isn't doing anything
wrong, but rather the problem is that mawk buffers the INPUT in addition
to the output.

I did this:

$ date;                                                                  \
  for i in `seq 5` ; do                                                  \
      echo x;                                                            \
      sleep 1;                                                           \
  done | ~/projects/ltrace/ltrace -t -x '*@libc.so.6' -o /tmp/ltrace.out \
         mawk '{ print $1; fflush() }' | ts;                             \
  egrep 'flush' /tmp/ltrace.out | grep -v resumed

  Sat Oct 14 21:48:42 PDT 2017
  Oct 14 21:48:47 x
  Oct 14 21:48:47 x
  Oct 14 21:48:47 x
  Oct 14 21:48:47 x
  Oct 14 21:48:47 x
  21:48:47 fflush( <unfinished ...>
  21:48:47 fflush@libc.so.6( <unfinished ...>
  21:48:47 fflush( <unfinished ...>
  21:48:47 fflush@libc.so.6( <unfinished ...>
  21:48:47 fflush( <unfinished ...>
  21:48:47 fflush@libc.so.6( <unfinished ...>
  21:48:47 fflush( <unfinished ...>
  21:48:47 fflush@libc.so.6( <unfinished ...>
  21:48:47 fflush( <unfinished ...>
  21:48:47 fflush@libc.so.6( <unfinished ...>


I.e. we started at 21:48:42, and I'd expect prints at :42, :43, :44, ...
We don't get those. Ltrace DOES show the fflush() libc calls, but those
all happen at the end, not as the data comes in. This is wrong, and
implies that the problem is on the data READING end, not the WRITING
end. And I can prove it even more conclusively:

  $ date; \
    for i in `seq 5` ; do echo x ; sleep 1 ; done | mawk '{ system("date") }'

  Sat Oct 14 21:53:05 PDT 2017
  Sat Oct 14 21:53:10 PDT 2017
  Sat Oct 14 21:53:10 PDT 2017
  Sat Oct 14 21:53:10 PDT 2017
  Sat Oct 14 21:53:10 PDT 2017
  Sat Oct 14 21:53:10 PDT 2017

I would expect mawk to read the data a line at a time and to invoke the
'date' command each time. Instead we get a bunch of dates at the end,
meaning that mawk is buffering the input.

Finally, looking at the syscalls I get this:

  $ for i in `seq 5` ; do echo x ; sleep 1 ; done | strace -f -e read,execve,clone -t mawk '{ system("date") }'

  21:56:48 execve("/usr/bin/mawk", ["mawk", "{ system(\"date\") }"], [/* 40 vars */]) = 0
  21:56:48 read(3, "\177ELF\2\1\1\3\0\0\0\0\0\0\0\0\3\0>\0\1\0\0\0\200V\0\0\0\0\0\0"..., 832) = 832
  21:56:48 read(3, "\177ELF\2\1\1\3\0\0\0\0\0\0\0\0\3\0>\0\1\0\0\0\0\4\2\0\0\0\0\0"..., 832) = 832
  21:56:48 read(0, "x\n", 4096)           = 2
  21:56:48 read(0, "x\n", 4094)           = 2
  21:56:49 read(0, "x\n", 4092)           = 2
  21:56:50 read(0, "x\n", 4090)           = 2
  21:56:51 read(0, "x\n", 4088)           = 2
  21:56:52 read(0, "", 4086)              = 0
  21:56:53 clone(strace: Process 16988 attached
  child_stack=NULL, flags=CLONE_CHILD_CLEARTID|CLONE_CHILD_SETTID|SIGCHLD, child_tidptr=0x7f5b16f029d0) = 16988
  [pid 16988] 21:56:53 execve("/bin/sh", ["/bin/sh", "-c", "date"], [/* 40 vars */]) = 0
  [pid 16988] 21:56:53 read(3, "\177ELF\2\1\1\3\0\0\0\0\0\0\0\0\3\0>\0\1\0\0\0\0\4\2\0\0\0\0\0"..., 832) = 832
  [pid 16988] 21:56:53 clone(strace: Process 16989 attached
  child_stack=NULL, flags=CLONE_CHILD_CLEARTID|CLONE_CHILD_SETTID|SIGCHLD, child_tidptr=0x7f292bae69d0) = 16989
  [pid 16989] 21:56:53 execve("/bin/date", ["date"], [/* 40 vars */]) = 0
  ....

I.e. mawk is reading the data when it comes in (the read() syscalls are
not delayed), but it must be chunking the data in userspace in some way
before doing any processing. I'm guessing this is a performance
optimization. When using gawk this doesn't happen, and the only
buffering to worry about is the output buffering, which can be disabled
with fflush(). If this extra layer of buffering isn't a bug (and I'm
assuming it's not), then there really isn't anything to do, but we
REALLY should add a note in the manpage, maybe next to where fflush() is
documented. Because this layer of buffering breaks the thing that
users like me try to achive with fflush().

Thanks!