Between grep 2.18-2 and grep 2.20-3 (fixing bug 758105), there is a huge slowdown when binary files (with invalid UTF-8 sequences) are involved. The timings on my personal svn working copy (with all my files), when searching for a word that doesn't exist (no matches): grep 2.18-2: 0.9 s grep 2.20-3: 11.6 s Note: the -P is useless in this case, but it is useful in other cases.
Vincent Lefevre wrote: It's not clear from http://bugs.debian.org/761157 that the performance problem occurs only with -P, but I assume that's what is meant. Since this is a performance bug with PCRE, I suggest moving the Debian bug report to the Debian libpcre3 package. Grep cannot go back to the old way, which could cause grep to crash, and the bug cannot be fixed in grep because libpcre3 does not provide a fast way to search arbitrary data that may include encoding errors. It really is a problem that requires changes to libpcre3 to fix; grep cannot fix it. In the meantime, in order to use 'grep' to search for strings in arbitrary data, I suggest omitting the '-P'. Also, I suggest using the C locale. As the GNU bug 18266 "grep -P and invalid exits with error" has been fixed, I'm closing that bug report. Please feel free to open a separate GNU bug report for the performance issue. PS. While composing this email I noticed another bug in grep -P and encoding errors, which I fixed by installing the attached patch.
Thanks for fixing yet another bug, Paul. Would you mind adding a test to trigger that one?
Ordinarily I would have done that already but this -P stuff is so buggy and slow that I got discouraged. (If we keep having trouble with -P I may start lobbying to remove it....) Anyway, I gave it a shot with the attached further patch.
It's specific to -P: 2.18-2 0.9s with -P, 0.4s without -P 2.20-3 11.6s with -P, 0.4s without -P Fixing the performance problem in libpcre3 would indeed be better (even with the old version of grep, libpcre3 was twice as slow as grep, but this is less critical than a 13x slowdown). However a workaround in grep could be simpler. I've just opened a new bug and suggested several solutions: http://debbugs.gnu.org/cgi/bugreport.cgi?bug=18454 This is a bit annoying because I sometimes use specific PCRE features. I could try to parse the arguments, detect where the pattern is used, and avoid -P if the pattern doesn't use specific PCRE features (at least for the most common forms). An additional advantage is that it could be twice as fast in most cases (see above). This could also be done in grep, as I suggested in my new bug report. This could be a solution, because in practice, I pipe the result to "less -FRX", but only grep has to use the C locale, so that the accented characters are correctly displayed by "less". However with some (rare?) patterns, it won't work because an accented character would no longer be seen as a single character.
Thank you. Looks perfect. I too rely on grep's -P, sometimes using PCRE features that are very hard to emulate using EREs.
Control: found -1 3.7-1 The current grep version 3.7-1 is also affected. The upstream bug has been closed after the move to PCRE2. Once the new grep is available in Debian, some tests need to be done to see whether this change introduces any regression.
Control: found -1 3.11-4 For instance, on some large binary file: 7.15s with -P 0.12s without -P That's even much worse than before: the ratio was 13, it is now 59!
Hmm... The upstream bug was closed and archived more than 3 years ago! I'm Cc'ing Paul, who closed the bug upstream. BTW, the slowness (with a regexp consisting of just letters) also seems to occur with grep 3.11 under Termux/Android.
I assume the upstream bug is <https://bugs.gnu.org/18454>. Can you supply more details, such as PCRE2 version, locale, hardware, and a test case for this new problem? I just checked, and grep-3.11-9.fc41.x86_64 and pcre2-tools-10.44-1.fc41.1.x86_64 run at the same speed on Fedora 41 x86-64 (AMD Phenom II X4 910e) in the en_US.utf8 locale when given the test case data noted in the upstream bug report: $ time grep -P zzzyyyxxx 10840.pdf real 0m0.071s user 0m0.063s sys 0m0.007s $ time pcre2grep -U zzzyyyxxx 10840.pdf real 0m0.072s user 0m0.062s sys 0m0.009s You can get that test data from <http://research.nhm.org/pdfs/10840/10840.pdf>.
Yes (this is written at the top of the web page of the Debian bug). After some tests, this is actually a different bug since text files are affected too. To reproduce: $ seq 600000000 > file $ ls -l --human-readable file -rw-r--r-- 1 vinc17 vinc17 5.5G 2025-02-26 22:10:29 file $ time grep x file Command exited with non-zero status 1 0.10user 0.48system 0:00.59elapsed 99%CPU (0avgtext+0avgdata 2048maxresident)k 0inputs+0outputs (0major+161minor)pagefaults 0swaps $ time grep -P x file Command exited with non-zero status 1 15.92user 0.47system 0:16.39elapsed 99%CPU (0avgtext+0avgdata 2560maxresident)k 0inputs+0outputs (0major+176minor)pagefaults 0swaps pcre2grep also takes a lot of time, whether -U is given or not. So this new issue may be a bug in the PCRE2 library.