#643021 [recoll] Forked CLI call does not return (all) hits

#643021#5
Date:
2011-09-26 16:09:35 UTC
From:
To:
--- Please enter the report below this line. ---
I am running recoll -t from a krunner plugin, i.e. forking it in the
background. This worked fine while back. However, now (last few versions),
using the -f (filename search) option returns no hits at all. Users of this
plugin has also reported far too few hits on default searches.

Debugging code showed correct command string and then XFNONE, 0 results
strings returned. Simply running the same command in a console works fine,
allbeit it sometimes with differing results than in the GUI program.

Note that users had reported problems with permissions on multi-user systems
so this may be a place to look.


Debian Release: wheezy/sid
  990 unstable        www.debian-multimedia.org
  990 unstable        liquorix.net
  990 unstable        http.us.debian.org
  990 unstable        debian.tagancha.org
  990 unstable        debian.scribus.net
  990 unstable        debian.pengutronix.de
  650 testing         security.debian.org
  650 testing         http.us.debian.org
  650 testing         dl.google.com
  500 stable          security.debian.org
  500 stable          http.us.debian.org
  500 stable          deb.opera.com
  500 karmic          ppa.launchpad.net
  500 intrepid        ppa.launchpad.net
  101 experimental-snapshots qt-kde.debian.net
    1 experimental    debian.co.il
--- Package information. ---
Depends                  (Version) | Installed
==================================-+-===================
libc6                (>= 2.3.6-6~) | 2.13-21
libgcc1               (>= 1:4.1.1) | 1:4.6.1-12
libqtcore4      (>= 4:4.7.0~beta1) | 4:4.7.3-8
libqtgui4             (>= 4:4.5.3) | 4:4.7.3-8
libstdc++6                (>= 4.6) | 4.6.1-12
libx11-6                           | 2:1.4.4-2
libxapian22                        | 1.2.7-1
zlib1g                (>= 1:1.2.0) | 1:1.2.3.4.dfsg-3


Recommends      (Version) | Installed
=========================-+-===========
aspell                    | 0.60.7~20110707-1
python                    | 2.6.7-3
xdg-utils                 | 1.1.0~rc1-2
xsltproc                  | 1.1.26-8


Suggests                    (Version) | Installed
=====================================-+-===========
antiword                              | 0.37-6
catdoc                                |
flac                                  | 1.2.1-5
ghostscript                           | 9.02~dfsg-3
libid3-tools                          | 3.8.3-14
libimage-exiftool-perl                | 8.60-2
lyx                                   |
poppler-utils                         | 0.16.7-2+b1
pstotext                              |
python-chm                            |
python-mutagen                        |
unrtf                                 |
untex                                 |
vorbis-tools                          | 1.4.0-1

#643021#10
Date:
2011-09-27 09:26:25 UTC
From:
To:
David Baron writes:
 > Package: recoll
 > Version: 1.16.0-1
 > Severity: important
 >
 > --- Please enter the report below this line. ---
 > I am running recoll -t from a krunner plugin, i.e. forking it in the
 > background. This worked fine while back. However, now (last few versions),
 > using the -f (filename search) option returns no hits at all. Users of this
 > plugin has also reported far too few hits on default searches.
 >
 > Debugging code showed correct command string and then XFNONE, 0 results
 > strings returned. Simply running the same command in a console works fine,
 > allbeit it sometimes with differing results than in the GUI program.
 >
 > Note that users had reported problems with permissions on multi-user systems
 > so this may be a place to look.

Hello,

It would be very helpful to have the log files for running the command in a
console and through krunner.

Please either set up recoll to log to a file, or arrange to retrieve stderr
output, and set the debug level to 6, either through the config GUI or by
editing ~/.recoll/recoll.conf:

logfilename=/some/file/name
loglevel = 6

- Run the search in a console.
- Save the log file (it will be erased at the next step)
- Run the same search from krunner.
- Save the log file.

Then please send both logs to me (jfd at recoll dot org).

More than permissions (which are there to be observed), one possible area
of concern might be wildcard character expansion by the shell: use proper
quoting when running in the console (ie: recoll -t -f 'there *re *ildcards'),
and we'll probably have to check how krunner executes the command too, but
this kind of issue should be visible in the log file anyway.

Regards,

jf

#643021#15
Date:
2011-09-27 09:50:27 UTC
From:
To:
console-log from console
krunner-log from krunner

#643021#20
Date:
2011-09-27 10:01:01 UTC
From:
To:
<Log files enclosed:
<console-log from console
<krunner-log from krunner

Note that the krunner one has a query *'downloads'*  !!

I do not do this, obviously.

I have asked a correspondant to do this same test with a non -f test which was
also not succeding but returning 3 / 150 hits.

#643021#25
Date:
2011-09-27 14:04:29 UTC
From:
To:
David Baron writes:
 > <Log files enclosed:
 > <console-log from console
 > <krunner-log from krunner
 >
 > Note that the krunner one has a query *'downloads'*  !!
 >
 > I do not do this, obviously.
 >
 > I have asked a correspondant to do this same test with a non -f test
 > which was also not succeding but returning 3 / 150 hits.

Ok, thanks for the logs, they make it clearer what is happening here.

From krunner:

:4:../rcldb/rclquery.cpp:174:Query::setQuery:
:4:../rcldb/rcldb.cpp:1525:Rcl::Db::filenameWildExp: pattern:[*'Downloads'*]

Command line:

:4:../rcldb/rclquery.cpp:174:Query::setQuery:
:4:../rcldb/rcldb.cpp:1525:Rcl::Db::filenameWildExp: pattern: [Downloads]

I will be using [] for quoting in the rest of the message (the [] are not
part of the strings).

First a bit of explanation on the handling of file name searches: recoll
will prepend and append a [*] to a file name search if it does not
already contain wildcards and is not capitalized. Trying to do the right
thing here, but maybe being slightly too clever.

So the krunner search is expanded from ['Downloads'] to [*'Download'*] because
['] is not a capital (not punctuation either because of searches like
o'donnell etc.)

The second search is not expanded because [D] is a capital. Alternatively,
searching for [download] would yield a [*download*] search.

This is all particularly ennoying because it does not show in the end
search, which only has the XNONENoMatchingTerms thing, because expansion
actually occurs (or not) before the search is passed to Xapian.

I'll easily admit that the Recoll choices are dubious here (I'm open to
suggestions), and I was going to write that I'd least document this
disconcerting behaviour of the file name search, but in fact, it is,
already:

http://www.recoll.org/usermanual/rcl.search.html#RCL.SEARCH.SIMPLE

The actual problem here seems to be too much quoting in the data sent by
krunner. The parameter incoming to recoll is really ['Download'] when it
should just be [Download]. This might also cause the other query issues
that you mention.

What's strange is that such a krunner issue should also show with other
commands ? Or was the search actually entered with single quotes in the
krunner window ? I can't really guess what happens or should be done here
because I don't know how krunner executes commands (sh -c or exec(2) or
whatever...)

Getting close...

Cheers,

jf

#643021#30
Date:
2011-09-27 14:26:33 UTC
From:
To:
The run is being done by a start( QString cmd, QStringList args ) type of
fork. I, as recommended, place the query string argument in single quotes in
the program, not in krunner's text line window. I assume the internal start()
function is an exec but I could be wrong.

Question, since the query string is a singe QString, last entry in the
QStringList, should the quotes not be there? Within this list of arguments,
there is no ambiguity. Question would be after it is expanded in the run
shell, would the non-quoted string be problematic?

Easy enough to try out but not knowing recoll's internals, cannot really touch
all the bases.

The * problem does not explain a non-filename problem--I hope the
correspondent did the same tests and logfiles and sent them as I suggested to
him.

Ultimately, I should probably take snippets from recoll's sources and do it
directly to xapian rather than the fork, but this runner is meant to be simple
and small. Performance in such an interactive environment is not an issue.

#643021#35
Date:
2011-09-27 14:57:45 UTC
From:
To:
A big question, however: The GUI implies and my results seem to indicate that
the and/or/query-language options do not work with filenames. Is this true? Or
would they work WITH the quotes (seems not to)?

This is a GUI design issue since if filename is an exclusive option, then it
would radio-button with the others or gray them if checkboxed.

#643021#40
Date:
2011-09-27 16:54:20 UTC
From:
To:
David Baron writes:
 > On Tuesday 28 Elul 5771 17:26:33 David Baron wrote:
 > > On Tuesday 28 Elul 5771 17:04:29 Jean-Francois Dockes wrote:
 > > > David Baron writes:
 > > >  > <Log files enclosed:
 > > >  > <console-log from console
 > > >  > <krunner-log from krunner
 > > >  >
 > > >  > Note that the krunner one has a query *'downloads'*  !!
 > > >  >
 > > >  > I do not do this, obviously.
 > > >  >
 > > >  > I have asked a correspondant to do this same test with a non -f test
 > > >  > which was also not succeding but returning 3 / 150 hits.
 > > >
 > > > Ok, thanks for the logs, they make it clearer what is happening here.
 > > >
 > > > >From krunner:
 > > > :4:../rcldb/rclquery.cpp:174:Query::setQuery:
 > > > :4:../rcldb/rcldb.cpp:1525:Rcl::Db::filenameWildExp:
 > > > :pattern:[*'Downloads'*]
 > > >
 > > > Command line:
 > > > :4:../rcldb/rclquery.cpp:174:Query::setQuery:
 > > > :4:../rcldb/rcldb.cpp:1525:Rcl::Db::filenameWildExp: pattern: [Downloads]
 > > >
 > > > I will be using [] for quoting in the rest of the message (the [] are not
 > > > part of the strings).
 > > >
 > > > First a bit of explanation on the handling of file name searches: recoll
 > > > will prepend and append a [*] to a file name search if it does not
 > > > already contain wildcards and is not capitalized. Trying to do the right
 > > > thing here, but maybe being slightly too clever.
 > > >
 > > > So the krunner search is expanded from ['Downloads'] to [*'Download'*]
 > > > because ['] is not a capital (not punctuation either because of searches
 > > > like o'donnell etc.)
 > > >
 > > > The second search is not expanded because [D] is a capital.
 > > > Alternatively, searching for [download] would yield a [*download*]
 > > > search.
 > > >
 > > > This is all particularly ennoying because it does not show in the end
 > > > search, which only has the XNONENoMatchingTerms thing, because expansion
 > > > actually occurs (or not) before the search is passed to Xapian.
 > > >
 > > > I'll easily admit that the Recoll choices are dubious here (I'm open to
 > > > suggestions), and I was going to write that I'd least document this
 > > > disconcerting behaviour of the file name search, but in fact, it is,
 > > > already:
 > > >
 > > > http://www.recoll.org/usermanual/rcl.search.html#RCL.SEARCH.SIMPLE
 > > >
 > > > The actual problem here seems to be too much quoting in the data sent by
 > > > krunner. The parameter incoming to recoll is really ['Download'] when it
 > > > should just be [Download]. This might also cause the other query issues
 > > > that you mention.
 > > >
 > > > What's strange is that such a krunner issue should also show with other
 > > > commands ? Or was the search actually entered with single quotes in the
 > > > krunner window ? I can't really guess what happens or should be done here
 > > > because I don't know how krunner executes commands (sh -c or exec(2) or
 > > > whatever...)
 > >
 > > The run is being done by a start( QString cmd, QStringList args ) type of
 > > fork. I, as recommended, place the query string argument in single quotes
 > > in the program, not in krunner's text line window. I assume the internal
 > > start() function is an exec but I could be wrong.
 > >
 > > Question, since the query string is a singe QString, last entry in the
 > > QStringList, should the quotes not be there? Within this list of arguments,
 > > there is no ambiguity. Question would be after it is expanded in the run
 > > shell, would the non-quoted string be problematic?

You'd have to check what the "start" function actually does. If it starts a
shell to execute the command, in a way which will make the wildcards expand
(and the quoting be removed), you need quoting. Given the look of the call,
I'd guess that it's closer to a simple fork/exec operation, meaning that no
wildcard expansion will take place before recoll receives the arguments,
and that you must not quote.

 > > Easy enough to try out but not knowing recoll's internals, cannot really
 > > touch all the bases.

Recoll internals are not in cause here, you'd have the same problems
executing "vi" or "ls"

 > I tried it and lo and behold, I get filename search results.

Ok, then confirmation of the fork/exec kind of spawn.

 > A big question, however: The GUI implies and my results seem to indicate
 > that the and/or/query-language options do not work with filenames. Is
 > this true? Or would they work WITH the quotes (seems not to)?
 >
 > This is a GUI design issue since if filename is an exclusive option, then it
 > would radio-button with the others or gray them if checkboxed.

Hhm sorry, I'm a bit lost here, what radio-buttons ? I'm not sure what
dialog we're talking about here ?

Using the query language, filename queries can be normally combined in
others ie, like in [wildcard filename:*manual*] (this would return among
others usermanual.sgml which has the term [wildcard] in it).

Using the simple search "File name" option, there is nothing to combine it
with, this is a pure file name search.

I think that this is more or less correctly described in the "search"
section of the manual:
http://www.lesbonscomptes.com/recoll/usermanual/rcl.search.html

There many possible combinations though, and I'm not sure I've tested them
all. I'll be glad to try and fix problems that I did not see.

Cheers,

jf

#643021#45
Date:
2011-09-27 17:01:33 UTC
From:
To:
Sorry forgot to answer to these in the previous email:

David Baron writes:
 > The * problem does not explain a non-filename problem--I hope the
 > correspondent did the same tests and logfiles and sent them as I
 > suggested to him.

Excessive quoting may also affect non-filename searches, and there is also
the capital issue, if the user is not careful about it, search results will
be different (because of stemming/no stemming).

 > Ultimately, I should probably take snippets from recoll's sources and do
 > it directly to xapian rather than the fork, but this runner is meant to
 > be simple and small. Performance in such an interactive environment is
 > not an issue.

Going directly to Xapian would be quite complex. Recoll does quite a lot of
processing before asking stuff from Xapian, I would really not recommand
this (except if you want to rewrite recoll :) )

Actually I think that your approach is quite reasonable. Another
possibility would be to either use the Python API or the C++ interface
which is just below this: as it's use to implement the Python and PHP Apis
and also the recollq program, it's quite stable, and simple (take a look at
recollq). The main problem going this way would be build issues, as Recoll
is not currently structured to export a library (which is why the Python
approach would really be the most natural, except that your program is C++
I guess).

Cheers,

JF

#643021#50
Date:
2011-09-27 18:42:37 UTC
From:
To:
Attached are 4 log files :

  * one from "recoll -t -q gazette" (155 results)
  * one from recollrunner with the same query (only "default query
    language" checked in recollrunner config) (3 results : only the ones
    among the 155 which do not contain spaces in their pathes)
  * one from recoll -t -f -q gazette" (46 results)
  * one from recollrunner with the same query ("default query language
    checked" and "match filenames" checked in recollrunner config) (0
    result)


I hope it will help solving this issue.

Regards

Denis

#643021#55
Date:
2011-09-28 06:35:41 UTC
From:
To:
Denis Prost writes:
 >    Attached are 4 log files :
 >      * one from "recoll -t -q gazette" (155 results)
 >      * one from recollrunner with the same query (only "default query
 >        language" checked in recollrunner config) (3 results : only the
 >        ones among the 155 which do not contain spaces in their pathes)
 >      * one from recoll -t -f -q gazette" (46 results)
 >      * one from recollrunner with the same query ("default query language
 >        checked" and "match filenames" checked in recollrunner config) (0
 >        result)
 >
 >    I hope it will help solving this issue.
 >    Regards
 >    Denis

Thanks a lot for the log files, my comments below:

first:
 > :4:../rcldb/rcldb.cpp:1525:Rcl::Db::filenameWildExp: pattern: [*gazette*]

My guess is that this is from the 3d query (recoll -t -f -q gazette). The
"-q" which would specify a "query language" query is ignored (because of how
the options are parsed), and this is a filename query where gazette is
transformed to *gazette* because it is neither capitalized nor contains
wildcards. It is supposed to return all documents with [gazette] as part of
their file name.

Second:
 > :4:../rcldb/searchdata.cpp:782:StringToXapianQ:: query string: [gazette]

This is from  [recoll -t -q gazette], which is a regular text search query,
returning all documents with gazette or a derivative ([gazettes]) in the
contents, or possibly in the file name field processed as text.

Third:

 > :4:../rcldb/searchdata.cpp:782:StringToXapianQ:: query string: ['gazette']

This is probably from recollrunner with only 'default query language'
checked: there is excessive quoting, but it doesn't hurt much because this
is a full text search and the quotes get eliminated. I don't know why
recollrunner returns few results, but as you mention that these are only
the ones without spaces in the file name, I'd suspect a problem parsing the
output from recoll.

Fourth:
 > :4:../rcldb/rcldb.cpp:1525:Rcl::Db::filenameWildExp: pattern: [*'gazette'*]

This is with recollrunner, "match filenames" and "default query language"
checked. "Match filename" takes precedence and the query fails because of the
excessive quoting.

The only thing that I find strange in the logs is that the 3rd one seems to
indicate that the query actually returns more results than the 1st one,
when I would have thought that they are identical. But the quoting may have
affected the query, the actual Xapian query is truncated in the log for
some reason, so we can't be sure:

:4:../rcldb/rclquery.cpp:237:Query::SetQuery: Q: ((gazette:(wqf=11) OR gazettes OR gazet:4:../rcldb/rclquery.cpp:344:Fetching for first 50, count 50

So I think that the first fixes should be for recollrunner to:
 - Avoid excessive single quote quoting
 - Indicate somehow that "query language" and "file name search" are
   different and exclusive modes.
 - Try to better parse the query output when there are spaces in the file
   names.

And then we may get into possible Recoll issues. I'd be quite interested
though by the logs from the 2 following commands:

recoll -t -q gazette
recoll -t -q "'gazette'"

Cheers,

Jf

#643021#60
Date:
2011-09-28 05:06:08 UTC
From:
To:
Here are the two logs :

  * recoll -t -q gazette.log (same as already sent)
  * recoll -t -q "gazette".log

Regards,

Denis

#643021#65
Date:
2011-09-28 08:39:43 UTC
From:
To:
I am no longer quoting filename searches.

I have changed the stdout line parsing to
.....[ --> mimetype after trimming
[......] --> URL/path
[----]  --> name, title, etc ...

Spaces are not used for anything (except removed from the mimetype). I can see
filenames with spaces.

krunner seems to be not including every match I feed to it. In other words, I
know I am getting three filename results into the program but only one of them
(first one?) actually gets displayed. This may be why Denis only still sees
three of his gazettes (unless this is still the space problem). In any event,
I may post next week a new version on kde-apps.

#643021#70
Date:
2011-09-28 09:20:33 UTC
From:
To:
David Baron writes:
 > On Wednesday 29 Elul 5771 09:35:41 Jean-Francois Dockes wrote:
 > > This is probably from recollrunner with only 'default query language'
 > > checked: there is excessive quoting, but it doesn't hurt much because this
 > > is a full text search and the quotes get eliminated. I don't know why
 > > recollrunner returns few results, but as you mention that these are only
 > > the ones without spaces in the file name, I'd suspect a problem parsing the
 > > output from recoll.
 >
 > I am no longer quoting filename searches.
 >
 > I have changed the stdout line parsing to
 > .....[ --> mimetype after trimming
 > [......] --> URL/path
 > [----]  --> name, title, etc ...
 >
 > Spaces are not used for anything (except removed from the mimetype). I can see
 > filenames with spaces.
 >
 > krunner seems to be not including every match I feed to it. In other words, I
 > know I am getting three filename results into the program but only one of them
 > (first one?) actually gets displayed. This may be why Denis only still sees
 > three of his gazettes (unless this is still the space problem). In any event,
 > I may post next week a new version on kde-apps.

Ok, I don't know enough about krunner to be of real usefulness here.

We should be aware that the recollq/recoll -t output is not fully parseable
at this point (a file name with ']' in it would break it). If you can get
the krunner part to behave, and if you decide that the current approach is
the sensible one (as compared to using an API), I could easily be convinced
to provide a fully and easily parsable output format (for example by encoding
the data parts in base64), we can talk about this.

Cheers,

jf

#643021#75
Date:
2011-10-01 17:41:19 UTC
From:
To:
I do think that a fully, consistently parsable output is desirable. This would
enable various scripting options, not just for my little krunner. Also, using
%20 instead of spaces and appropriate codings for other illegal characters
would make thie [URL] canonical/legal. "File://...." implies URL. Was it done
this way previously (why space problem did not show up before)?

On filename searches, simple text used as *name* makes sense. Capitals are
sometime automatic, many times ignored, and are language specific.
Name queried as Name* might make sense. "name" would imply no wildcards. This
stuff is easier than regex but regex might be desirable as a query alternative
for text and filenames.

Using an API? Is there one in the works?

#643021#80
Date:
2021-12-20 17:12:26 UTC
From:
To:
שלום ערב טוב, אנא התקשר אליי עכשיו או השב למייל ששלחתי לך מאתמול