Package: htdig Version: 3.1.4-1 This is probably for upstream. I use htdig with a locale: de_DE setting. It seems unable to find occurrences of words containing non-ascii characters that are part of titles, <Hn> or emphasis elements. Say, if i look for "bég" in my data, it finds an index.html document that contains the line <a href="beg-islamabad-1990.html">Bég 1991: From the Quark Model to the Stand...</a> but not the document beg-islamabad-1990.html itself, that starts with: <html><head><title>Bég 1991: From the Quark Model to the Stand...</title> <body> <h1>Mirza Abdul Baqi <strong>Bég</strong>: From the Quark Model to the Standard Model: Ten Fateful Years in Particle Physics (1964--74 C.\,E.).</h1> <p>Mirza Abdul Baqi <strong>Bég</strong> (1991): <em>From the Quark Model to the Standard Model: Ten Fateful Years in Particle Physics (1964--74 C.\,E.).</em> It also doesn't find another document containing <p><a href="beg-islamabad-1990.html">Mirza Abdul Baqi <strong>Bég</strong>: <em>From the Quark Model to the Stand...</em> 221-284</a></p> although it finds both documents if I look for "Mirza". Yours, Florian.
I just got this bugreport in the debian BTS---------- Forwarded message ---------- Date: Mon, 31 Jan 2000 17:13:54 +0100 From: Florian Hars <florian@hars.de> To: submit@bugs.debian.org Subject: Bug#56721: htdig and locale de_DE peculiarities. Resent-Date: Mon, 31 Jan 2000 16:18:02 +0000 (GMT) Resent-From: Florian Hars <florian@hars.de> Resent-To: debian-bugs-dist@lists.debian.org Resent-cc: Gergely Madarasz <gorgo@sztaki.hu> Package: htdig Version: 3.1.4-1 This is probably for upstream. I use htdig with a locale: de_DE setting. It seems unable to find occurrences of words containing non-ascii characters that are part of titles, <Hn> or emphasis elements. Say, if i look for "bég" in my data, it finds an index.html document that contains the line <a href="beg-islamabad-1990.html">Bég 1991: From the Quark Model to the Stand...</a> but not the document beg-islamabad-1990.html itself, that starts with: <html><head><title>Bég 1991: From the Quark Model to the Stand...</title> <body> <h1>Mirza Abdul Baqi <strong>Bég</strong>: From the Quark Model to the Standard Model: Ten Fateful Years in Particle Physics (1964--74 C.\,E.).</h1> <p>Mirza Abdul Baqi <strong>Bég</strong> (1991): <em>From the Quark Model to the Standard Model: Ten Fateful Years in Particle Physics (1964--74 C.\,E.).</em> It also doesn't find another document containing <p><a href="beg-islamabad-1990.html">Mirza Abdul Baqi <strong>Bég</strong>: <em>From the Quark Model to the Stand...</em> 221-284</a></p> although it finds both documents if I look for "Mirza". Yours, Florian. -- + when hideous hordes of web designers will leave ripped bloodless bodies of hosts they parasited upon and convulsively start tearing limbs of each other in agony illuminated by artificial light [...], then we know that time has come for dêë|||zêïñe++++ >>>> Å.Ñ.Ñ.Ï.H.Î.L.Ä.T.Î.Ö.Ñ -- www.absurd.org
At 5:21 PM +0100 1/31/00, Gergely Madarasz wrote: This is rather odd. You see, the HTML parser doesn't pay much attention to emphasis tags like <strong> or <em> and doesn't really do anything different about <Hn> tags as far as recording words. However, Marc Pohl <Marc.Pohl@wdr.de> found a problem with handling of 8-bit characters. I don't know whether it would fix this problem, but the patch is attached. Please let me know if this helps,
Geoff Hutchison <ghutchis@wso.williams.edu> writes: It doesn't change anything that I am aware of. Yours, Florian.
According to Gergely Madarasz: This is very strange. I can't see anything in the code that could explain the behaviour described below. Does debian include any patches to 3.1.4, or just a straight, unmodified installation of the 3.1.4 tarball? If there are any patches, please provide us with them. My first impulse was to say, oh, this is a problem with title_factor and heading_factor_1 through heading_factor_6 being set to 0, but that would not explain why unaccented words in headings and titles are found (unless those words appear elsewhere in the document). It also wouldn't explain why the <strong> tag has any effect - htdig normally ignores that tag. Given that the e-acute works sometimes, I think we can rule out a problem with the locale - that would either work consistently or fail consistently. Perhaps you should set start_url to the URLs of the two documents above that were giving you problems, and run htdig -vvvvv -i -s to see what it is doing when parsing these names. You may also want to change your database_dir so as to avoid clobbering your current database.
There are a few changes in the build process (don't build libdb, use glibc's libdb or libdb2, depending on the glibc version number), and a couple of changes for backporting to the older db api like: - if ((seqrc = dbp->cursor(dbp, NULL, &dbcp, 0)) != 0) + if ((seqrc = dbp->cursor(dbp, NULL, &dbcp)) != 0) No patches which would modify parsing. You can check the diff on ftp.debian.org/debian/dists/potato/main/source/web/htdig_3.1.4-1.diff.gz
According to Gergely Madarasz: Hmmm. I don't like the sound of "backporting to the older db". We have not determined that the problem is in HTML parsing, and in fact that seems quite unlikely at this point. We've included Sleepycat's Berkeley DB 2.6.4 (with a few patches), since about version 3.1.1, so I'd be leery about going back to an earlier version. I'm not sure how dependent the code is on the current version, but I think there were reasons for going with it. Version 3.2 of htdig will include a more recent version of the DB package, with a whole lot more patches to it, and it will definitely be dependent on the bundled version. I'd certainly recommend building 3.1.4 with its bundled db package before we go looking elsewhere for problems. The user reported problems that I simply can't explain based on the code, and are nothing like anything else we've seen, so it's clear that something is going very wrong deep in the bowels of the code, and I'd say that the DB package is as good a place to start as any. I took a look at the patch file above, and I must say that the man pages are a nice touch. We should probably fold them into the distribution. I did notice a problem with the debian/postinst script, though. The test for the endings and synomyms databases is wrong - right now it's testing for the document databases. Also, the message to the installer suggests that /usr/sbin/htdigconfig will rebuild an existing endings database. It won't. It will only rebuild the endings or the synonyms database if it doesn't exist. The code that was commented out of rundig probably does a better test, as it will rebuild if the source files are newer than the current synonyms or endings databases. I'd also highly recommend conv_doc.pl over parse_doc.pl for 3.1.4. I was also a bit curious about the huge htlib/DB2_db.I in the patch file. Is this an artifact, or was it necessary to get htdig to build with debian's DB package?
tarball? Was there any reason to upgrade to latest libdb any other that there was a newer version? If there are additional patches needed, were they sent upstream? If there are problems with libdb, then why does glibc still include an older version? The debian policy is to use shared libraries as much as possible, and I haven't seen any problems so far for using an older libdb... They were created by the previous maintainer of the .deb package, Geoff already asked about them and the answer was yes back then... Also remained from the previous maintainer... it is wrong but works most of the time :) nobody complained yet about it anyway, so if it works, I won't break it :) Yes, but currently those files are in /usr/lib/htdig which may be mounted readonly, so can't be rebuilt on the fly... they could be moved, but the upgrade could be difficult (automatic changes of conffiles are not allowed, etc....) parse_doc.pl was a bit modified by a fellow debian developer to handle all possible converters from .doc, .ps and .pdf available in debian... I didn't actually go into it, so I just included the file... Hmm... probably remained from the time I debugged why it wouldn't build with the installed libdb :) I'll remove it from the next upload.
I do not know why glibc includes an older version, though it may be a copyright assignment issue. Sleepycat's release notes included some bug fixes that sounded quite significant and implied they recommended the upgrade. (No, I don't remember what the issues were, it's been a while.) Yeah, I still have them--I'll add them to the CVS copy after the release so I have time to update them. I don't know that I'm as sure that DB is the cause here, but as I said, I cannot see anything in the code that would give these symptoms. So my suggestion roughly echos Gilles--compile the original tar from htdig.org and see if that helps. Remember that if the databases are altered, you have to clean out the old ones before reindexing (or use -a without any .work files present).
I think Geoff answered the first few of your questions. According to Gergely Madarasz: It works most of the time because if it finds a document database, chances are the endings and synonyms databases have been built. It'll break if you install an htdig update package that uses an incompatible endings or synonyms database format. So, it's not likely to be an issue just yet, but it could become one. What I was getting at is it may make sense to use similar tests in your postinst and htdigconfig scripts. The configuration section of conv_doc.pl is almost identical to parse_doc.pl's, so the debian-specific stuff should migrate easily. The only gotcha is that the new parse_doc.pl and conv_doc.pl use the -raw option by default for the pdftotext command that comes with xpdf 0.90, so you'd have to take that out to work with other PDF to text filters which the debian-specific code looks for. The text parser in parse_doc.pl is really pretty crude, and doesn't parse in a manner consistent with htdig's internal parsers.
At 2:58 PM -0600 2/4/00, Gilles Detillieux wrote: Actually, the endings and synonyms database formats will change with 3.2.0b1. So it's not "if it breaks" but "when it breaks."
Gilles Detillieux <grdetil@scrc.umanitoba.ca> writes: I did this restricting htdig to the two documents it did find with "mirza", but not with "bég", and there is nothing that looks funny to me. If I run htsearch with the database containing only these two files, it even finds them if I look for "bég". I then removed my old database and reindexed everything, and it did no longer find them with "bég". This might lend some credibility to the thesis that it is not the parser, but the db. Yours, Florian.