#117671 htdig: htdig does not spider hyperlinks where there is only a single character between the <a href...> and </a> tags.

Package:
htdig
Source:
htdig
Description:
web search and indexing system - binaries
Submitter:
"William R. Musssatto"
Date:
2005-07-18 03:50:02 UTC
Severity:
wishlist
#117671#5
Date:
2001-10-30 18:28:22 UTC
From:
To:
Hyperlinks with only a single character between the <a href...> and </a> tags are not followed and the associated page is not
included when building the search information. Catalog pages which are paginated, for example, are not searched.
Have to generate special index page which has more characters between the tags which allows the page to be spidered.
--- Begin /etc/cron.daily/htdig (modified conffile)
#!/bin/sh
nice /etc/htdig/vnwgindex.pl
nice /usr/bin/htdig -c /etc/htdig/vnwg.conf
nice /usr/bin/htmerge -c /etc/htdig/vnwg.conf
--- End /etc/cron.daily/htdig
#117671#10
Date:
2001-10-30 19:33:52 UTC
From:
To:
Bug is probably a documentation error.  Setting the minimum_word_length
to 1 solved the problem.  This is supposed to effect what goes into the
index, but it also appears to be used to determine the minimum link
length. You can downgrade the severity of the bug since there is a
workaround.

#117671#15
Date:
2001-10-30 23:22:24 UTC
From:
To:
Hi William,

I agree with you that this is not a very proper way to handle the
configuration option of minimum_word_length.

I'll downgrade the bug to wishlist and have a look at it.


Thank you anyway!

Stijn.

#117671#22
Date:
2001-10-31 01:02:01 UTC
From:
To:
Ok, but what I'd really like is cookies since I'm spidering a .jsp site
and each request is a new session. <sigh> I know its on the todo list.

Sincerely,

William Mussatto, Senior Systems Engineer
CyberStrategies, Inc
ph. 909-920-9154 ext. 27