#56721 htdig and locale de_DE peculiarities.

Package:
htdig
Source:
htdig
Description:
web search and indexing system - binaries
Submitter:
Florian Hars
Date:
2005-07-18 03:19:24 UTC
Severity:
normal
#56721#5
Date:
2000-01-31 16:13:54 UTC
From:
To:
Package: htdig
Version: 3.1.4-1

This is probably for upstream.

I use htdig with a locale: de_DE setting. It seems unable to find
occurrences of words containing non-ascii characters that are part of
titles, <Hn> or emphasis elements. Say, if i look for "bég" in my
data, it finds an index.html document that contains the line

<a href="beg-islamabad-1990.html">Bég 1991: From the Quark
Model to the Stand...</a>

but not the document beg-islamabad-1990.html itself, that starts with:

<html><head><title>Bég 1991: From the Quark Model to the
Stand...</title>
<body>
<h1>Mirza Abdul Baqi <strong>Bég</strong>: From the Quark Model
to the Standard Model: Ten Fateful Years in Particle Physics (1964--74
C.\,E.).</h1>
<p>Mirza Abdul Baqi <strong>Bég</strong> (1991): <em>From the
Quark Model to the Standard Model: Ten Fateful Years in Particle
Physics (1964--74 C.\,E.).</em>

It also doesn't find another document containing

<p><a href="beg-islamabad-1990.html">Mirza Abdul Baqi
<strong>Bég</strong>: <em>From the Quark Model to the
Stand...</em> 221-284</a></p>

although it finds both documents if I look for "Mirza".

Yours, Florian.

#56721#8
Date:
2000-01-31 16:21:43 UTC
From:
To:
I just got this bugreport in the debian BTS
---------- Forwarded message ---------- Date: Mon, 31 Jan 2000 17:13:54 +0100 From: Florian Hars <florian@hars.de> To: submit@bugs.debian.org Subject: Bug#56721: htdig and locale de_DE peculiarities. Resent-Date: Mon, 31 Jan 2000 16:18:02 +0000 (GMT) Resent-From: Florian Hars <florian@hars.de> Resent-To: debian-bugs-dist@lists.debian.org Resent-cc: Gergely Madarasz <gorgo@sztaki.hu> Package: htdig Version: 3.1.4-1 This is probably for upstream. I use htdig with a locale: de_DE setting. It seems unable to find occurrences of words containing non-ascii characters that are part of titles, <Hn> or emphasis elements. Say, if i look for "bég" in my data, it finds an index.html document that contains the line <a href="beg-islamabad-1990.html">Bég 1991: From the Quark Model to the Stand...</a> but not the document beg-islamabad-1990.html itself, that starts with: <html><head><title>Bég 1991: From the Quark Model to the Stand...</title> <body> <h1>Mirza Abdul Baqi <strong>Bég</strong>: From the Quark Model to the Standard Model: Ten Fateful Years in Particle Physics (1964--74 C.\,E.).</h1> <p>Mirza Abdul Baqi <strong>Bég</strong> (1991): <em>From the Quark Model to the Standard Model: Ten Fateful Years in Particle Physics (1964--74 C.\,E.).</em> It also doesn't find another document containing <p><a href="beg-islamabad-1990.html">Mirza Abdul Baqi <strong>Bég</strong>: <em>From the Quark Model to the Stand...</em> 221-284</a></p> although it finds both documents if I look for "Mirza". Yours, Florian. -- + when hideous hordes of web designers will leave ripped bloodless bodies of hosts they parasited upon and convulsively start tearing limbs of each other in agony illuminated by artificial light [...], then we know that time has come for dêë|||zêïñe++++ >>>> Å.Ñ.Ñ.Ï.H.Î.L.Ä.T.Î.Ö.Ñ -- www.absurd.org
#56721#9
Date:
2000-02-04 02:21:05 UTC
From:
To:
At 5:21 PM +0100 1/31/00, Gergely Madarasz wrote:

This is rather odd. You see, the HTML parser doesn't pay much
attention to emphasis tags like <strong> or <em> and doesn't really
do anything different about <Hn> tags as far as recording words.

However, Marc Pohl <Marc.Pohl@wdr.de> found a problem with handling
of 8-bit characters. I don't know whether it would fix this problem,
but the patch is attached.

Please let me know if this helps,

#56721#10
Date:
2000-02-04 12:57:22 UTC
From:
To:
Geoff Hutchison <ghutchis@wso.williams.edu> writes:

It doesn't change anything that I am aware of.

Yours, Florian.

#56721#11
Date:
2000-02-04 17:57:39 UTC
From:
To:
According to Gergely Madarasz:

This is very strange.  I can't see anything in the code that could
explain the behaviour described below.  Does debian include any patches to
3.1.4, or just a straight, unmodified installation of the 3.1.4 tarball?
If there are any patches, please provide us with them.

My first impulse was to say, oh, this is a problem with title_factor and
heading_factor_1 through heading_factor_6 being set to 0, but that would
not explain why unaccented words in headings and titles are found (unless
those words appear elsewhere in the document).  It also wouldn't explain
why the <strong> tag has any effect - htdig normally ignores that tag.

Given that the e-acute works sometimes, I think we can rule out a problem
with the locale - that would either work consistently or fail consistently.

Perhaps you should set start_url to the URLs of the two documents above
that were giving you problems, and run htdig -vvvvv -i -s to see what it
is doing when parsing these names.  You may also want to change your
database_dir so as to avoid clobbering your current database.

#56721#12
Date:
2000-02-04 18:05:08 UTC
From:
To:
There are a few changes in the build process (don't build libdb, use
glibc's libdb or libdb2, depending on the glibc version number), and a
couple of changes for backporting to the older db api like:
-        if ((seqrc = dbp->cursor(dbp, NULL, &dbcp, 0)) != 0)
+        if ((seqrc = dbp->cursor(dbp, NULL, &dbcp)) != 0)
No patches which would modify parsing.
You can check the diff on
ftp.debian.org/debian/dists/potato/main/source/web/htdig_3.1.4-1.diff.gz

#56721#13
Date:
2000-02-04 19:16:02 UTC
From:
To:
According to Gergely Madarasz:

Hmmm.  I don't like the sound of "backporting to the older db".  We have
not determined that the problem is in HTML parsing, and in fact that
seems quite unlikely at this point.  We've included Sleepycat's Berkeley
DB 2.6.4 (with a few patches), since about version 3.1.1, so I'd be leery
about going back to an earlier version.  I'm not sure how dependent the
code is on the current version, but I think there were reasons for going
with it.  Version 3.2 of htdig will include a more recent version of the
DB package, with a whole lot more patches to it, and it will definitely be
dependent on the bundled version.  I'd certainly recommend building 3.1.4
with its bundled db package before we go looking elsewhere for problems.

The user reported problems that I simply can't explain based on the
code, and are nothing like anything else we've seen, so it's clear that
something is going very wrong deep in the bowels of the code, and I'd
say that the DB package is as good a place to start as any.

I took a look at the patch file above, and I must say that the man pages
are a nice touch.  We should probably fold them into the distribution.

I did notice a problem with the debian/postinst script, though.  The test
for the endings and synomyms databases is wrong - right now it's testing
for the document databases.  Also, the message to the installer suggests
that /usr/sbin/htdigconfig will rebuild an existing endings database.
It won't.  It will only rebuild the endings or the synonyms database if
it doesn't exist.  The code that was commented out of rundig probably
does a better test, as it will rebuild if the source files are newer than
the current synonyms or endings databases.  I'd also highly recommend
conv_doc.pl over parse_doc.pl for 3.1.4.

I was also a bit curious about the huge htlib/DB2_db.I in the patch file.
Is this an artifact, or was it necessary to get htdig to build with
debian's DB package?

#56721#14
Date:
2000-02-04 19:35:06 UTC
From:
To:
tarball?

Was there any reason to upgrade to latest libdb any other that there was a
newer version? If there are additional patches needed, were they sent
upstream? If there are problems with libdb, then why does glibc still
include an older version? The debian policy is to use shared libraries as
much as possible, and I haven't seen any problems so far for using an
older libdb...

They were created by the previous maintainer of the .deb package, Geoff
already asked about them and the answer was yes back then...

Also remained from the previous maintainer... it is wrong but works most
of the time :) nobody complained yet about it anyway, so if it works,
I won't break it :)

Yes, but currently those files are in /usr/lib/htdig which may be mounted
readonly, so can't be rebuilt on the fly... they could be moved, but the
upgrade could be difficult (automatic changes of conffiles are not
allowed, etc....)

parse_doc.pl was a bit modified by a fellow debian developer to handle all
possible converters from .doc, .ps and .pdf available in debian... I
didn't actually go into it, so I just included the file...

Hmm... probably remained from the time I debugged why it wouldn't build
with the installed libdb :) I'll remove it from the next upload.

#56721#15
Date:
2000-02-04 20:11:07 UTC
From:
To:
I do not know why glibc includes an older version, though it may be a
copyright assignment issue. Sleepycat's release notes included some bug
fixes that sounded quite significant and implied they recommended the
upgrade. (No, I don't remember what the issues were, it's been a while.)

Yeah, I still have them--I'll add them to the CVS copy after the release
so I have time to update them.

I don't know that I'm as sure that DB is the cause here, but as I said, I
cannot see anything in the code that would give these symptoms. So my
suggestion roughly echos Gilles--compile the original tar from htdig.org
and see if that helps. Remember that if the databases are altered, you
have to clean out the old ones before reindexing (or use -a without any
.work files present).

#56721#16
Date:
2000-02-04 20:58:00 UTC
From:
To:
I think Geoff answered the first few of your questions.

According to Gergely Madarasz:

It works most of the time because if it finds a document database, chances
are the endings and synonyms databases have been built.  It'll break if
you install an htdig update package that uses an incompatible endings or
synonyms database format.  So, it's not likely to be an issue just yet,
but it could become one.

What I was getting at is it may make sense to use similar tests in your
postinst and htdigconfig scripts.

The configuration section of conv_doc.pl is almost identical to
parse_doc.pl's, so the debian-specific stuff should migrate easily.
The only gotcha is that the new parse_doc.pl and conv_doc.pl use the -raw
option by default for the pdftotext command that comes with xpdf 0.90,
so you'd have to take that out to work with other PDF to text filters
which the debian-specific code looks for.

The text parser in parse_doc.pl is really pretty crude, and doesn't parse
in a manner consistent with htdig's internal parsers.

#56721#17
Date:
2000-02-04 23:15:27 UTC
From:
To:
At 2:58 PM -0600 2/4/00, Gilles Detillieux wrote:

Actually, the endings and synonyms database formats will change with
3.2.0b1. So it's not "if it breaks" but "when it breaks."

#56721#18
Date:
2000-02-07 10:29:03 UTC
From:
To:
Gilles Detillieux <grdetil@scrc.umanitoba.ca> writes:

I did this restricting htdig to the two documents it did find with
"mirza", but not with "bég", and there is nothing that looks funny to
me.  If I run htsearch with the database containing only these two
files, it even finds them if I look for "bég".

I then removed my old database and reindexed everything, and it did
no longer find them with "bég".

This might lend some credibility to the thesis that it is not the
parser, but the db.

Yours, Florian.