#761157 grep -P is very slow on binary files in UTF-8 locales

Package:
grep
Source:
grep
Description:
GNU grep, egrep and fgrep
Submitter:
Vincent Lefevre
Date:
2025-02-26 21:24:01 UTC
Severity:
important
Tags:
#761157#5
Date:
2014-09-11 08:12:35 UTC
From:
To:
Between grep 2.18-2 and grep 2.20-3 (fixing bug 758105), there is a
huge slowdown when binary files (with invalid UTF-8 sequences) are
involved. The timings on my personal svn working copy (with all my
files), when searching for a word that doesn't exist (no matches):

grep 2.18-2:  0.9 s
grep 2.20-3: 11.6 s

Note: the -P is useless in this case, but it is useful in other cases.

#761157#10
Date:
2014-09-11 17:07:49 UTC
From:
To:
Vincent Lefevre wrote:

It's not clear from http://bugs.debian.org/761157 that the performance
problem occurs only with -P, but I assume that's what is meant.

Since this is a performance bug with PCRE, I suggest moving the Debian
bug report to the Debian libpcre3 package.  Grep cannot go back to the
old way, which could cause grep to crash, and the bug cannot be fixed in
grep because libpcre3 does not provide a fast way to search arbitrary
data that may include encoding errors.  It really is a problem that
requires changes to libpcre3 to fix; grep cannot fix it.

In the meantime, in order to use 'grep' to search for strings in
arbitrary data, I suggest omitting the '-P'.  Also, I suggest using the
C locale.

As the GNU bug 18266 "grep -P and invalid exits with error" has been
fixed, I'm closing that bug report.  Please feel free to open a separate
GNU bug report for the performance issue.

PS.  While composing this email I noticed another bug in grep -P and
encoding errors, which I fixed by installing the attached patch.

#761157#15
Date:
2014-09-11 18:37:11 UTC
From:
To:
Thanks for fixing yet another bug, Paul.
Would you mind adding a test to trigger that one?

#761157#20
Date:
2014-09-11 19:10:27 UTC
From:
To:
Ordinarily I would have done that already but this -P stuff is so buggy
and slow that I got discouraged.  (If we keep having trouble with -P I
may start lobbying to remove it....) Anyway, I gave it a shot with the
attached further patch.

#761157#25
Date:
2014-09-12 01:42:47 UTC
From:
To:
It's specific to -P:

2.18-2   0.9s with -P, 0.4s without -P
2.20-3  11.6s with -P, 0.4s without -P

Fixing the performance problem in libpcre3 would indeed be better
(even with the old version of grep, libpcre3 was twice as slow as
grep, but this is less critical than a 13x slowdown).

However a workaround in grep could be simpler. I've just opened a
new bug and suggested several solutions:

http://debbugs.gnu.org/cgi/bugreport.cgi?bug=18454

This is a bit annoying because I sometimes use specific PCRE features.
I could try to parse the arguments, detect where the pattern is used,
and avoid -P if the pattern doesn't use specific PCRE features (at
least for the most common forms). An additional advantage is that it
could be twice as fast in most cases (see above). This could also be
done in grep, as I suggested in my new bug report.

This could be a solution, because in practice, I pipe the result
to "less -FRX", but only grep has to use the C locale, so that the
accented characters are correctly displayed by "less". However with
some (rare?) patterns, it won't work because an accented character
would no longer be seen as a single character.

#761157#36
Date:
2014-09-12 16:13:22 UTC
From:
To:
Thank you. Looks perfect.

I too rely on grep's -P, sometimes using PCRE features
that are very hard to emulate using EREs.

#761157#41
Date:
2021-11-24 16:24:02 UTC
From:
To:
Control: found -1 3.7-1

The current grep version 3.7-1 is also affected.

The upstream bug has been closed after the move to PCRE2.

Once the new grep is available in Debian, some tests need to be
done to see whether this change introduces any regression.

#761157#50
Date:
2025-02-26 15:53:26 UTC
From:
To:
Control: found -1 3.11-4
For instance, on some large binary file:
  7.15s with -P
  0.12s without -P

That's even much worse than before: the ratio was 13, it is now 59!

#761157#57
Date:
2025-02-26 16:06:07 UTC
From:
To:
Hmm... The upstream bug was closed and archived more than 3 years ago!
I'm Cc'ing Paul, who closed the bug upstream.

BTW, the slowness (with a regexp consisting of just letters) also
seems to occur with grep 3.11 under Termux/Android.

#761157#62
Date:
2025-02-26 20:28:29 UTC
From:
To:
I assume the upstream bug is <https://bugs.gnu.org/18454>.

Can you supply more details, such as PCRE2 version, locale, hardware,
and a test case for this new problem? I just checked, and
grep-3.11-9.fc41.x86_64 and pcre2-tools-10.44-1.fc41.1.x86_64 run at the
same speed on Fedora 41 x86-64 (AMD Phenom II X4 910e) in the en_US.utf8
locale when given the test case data noted in the upstream bug report:

   $ time grep -P zzzyyyxxx 10840.pdf

   real	0m0.071s
   user	0m0.063s
   sys	0m0.007s
   $ time pcre2grep -U zzzyyyxxx 10840.pdf

   real	0m0.072s
   user	0m0.062s
   sys	0m0.009s

You can get that test data from
<http://research.nhm.org/pdfs/10840/10840.pdf>.

#761157#67
Date:
2025-02-26 21:20:13 UTC
From:
To:
Yes (this is written at the top of the web page of the Debian bug).

After some tests, this is actually a different bug since text files
are affected too. To reproduce:

$ seq 600000000 > file
$ ls -l --human-readable file
-rw-r--r-- 1 vinc17 vinc17 5.5G 2025-02-26 22:10:29 file
$ time grep x file
Command exited with non-zero status 1
0.10user 0.48system 0:00.59elapsed 99%CPU (0avgtext+0avgdata 2048maxresident)k
0inputs+0outputs (0major+161minor)pagefaults 0swaps
$ time grep -P x file
Command exited with non-zero status 1
15.92user 0.47system 0:16.39elapsed 99%CPU (0avgtext+0avgdata 2560maxresident)k
0inputs+0outputs (0major+176minor)pagefaults 0swaps

pcre2grep also takes a lot of time, whether -U is given or not.
So this new issue may be a bug in the PCRE2 library.