#144689 htdig: default htdig configuration uses massive disk space

Package:
htdig
Source:
htdig
Description:
web search and indexing system - binaries
Submitter:
Edward Doolittle
Date:
2005-07-18 04:02:29 UTC
Severity:
wishlist
#144689#5
Date:
2002-04-27 03:37:51 UTC
From:
To:
Hello,

I installed htdig on my system and in its first run it used 1.5 gig of
disk space before I stopped it. It was indexing every Debian document
on my system, which has quite a few packages installed. I believe
the problem is that the default /var/www/index.html has a link to
http://localhost/doc/apache/ and htdig is stripping off apache/ and
indexing /doc/ .  That really isn't what I expected when I installed
the htdig package.

I suggest that either htdig be configured to skip /doc/ (see my
htdig.conf below ... it would probably be better to skip //localhost/doc/
instead of /doc/ which might skip too many files); or that the default
/var/www/index.html not reference .../doc/apache/ .

Ed Doolittle <ed.doolittle@utoronto.ca>
--- Begin /etc/htdig/htdig.conf (modified conffile)
database_dir:		/var/lib/htdig
start_url:		http://localhost/
limit_urls_to:		${start_url}
exclude_urls:		/cgi-bin/ .cgi /doc/
bad_extensions:		.wav .gz .z .bz2 .sit .au .zip .tar .hqx .exe .com \
   .gif .jpg .jpeg .aiff .class .map .ram .tgz .bin .rpm .mpg .mov .avi .css
maintainer:		ed.doolittle@utoronto.ca
max_head_length:	10000
max_doc_size:		200000
no_excerpt_show_top:	true
search_algorithm:	exact:1 synonyms:0.5 endings:0.1
next_page_text:		<img src="/htdig/buttonr.gif" border="0" align="middle" width="30" height="30" alt="next">
no_next_page_text:
prev_page_text:		<img src="/htdig/buttonl.gif" border="0" align="middle" width="30" height="30" alt="prev">
no_prev_page_text:
page_number_text:	'<img src="/htdig/button1.gif" border="0" align="middle" width="30" height="30" alt="1">' \
			'<img src="/htdig/button2.gif" border="0" align="middle" width="30" height="30" alt="2">' \
			'<img src="/htdig/button3.gif" border="0" align="middle" width="30" height="30" alt="3">' \
			'<img src="/htdig/button4.gif" border="0" align="middle" width="30" height="30" alt="4">' \
			'<img src="/htdig/button5.gif" border="0" align="middle" width="30" height="30" alt="5">' \
			'<img src="/htdig/button6.gif" border="0" align="middle" width="30" height="30" alt="6">' \
			'<img src="/htdig/button7.gif" border="0" align="middle" width="30" height="30" alt="7">' \
			'<img src="/htdig/button8.gif" border="0" align="middle" width="30" height="30" alt="8">' \
			'<img src="/htdig/button9.gif" border="0" align="middle" width="30" height="30" alt="9">' \
			'<img src="/htdig/button10.gif" border="0" align="middle" width="30" height="30" alt="10">'
no_page_number_text:	'<img src="/htdig/button1.gif" border="2" align="middle" width="30" height="30" alt="1">' \
			'<img src="/htdig/button2.gif" border="2" align="middle" width="30" height="30" alt="2">' \
			'<img src="/htdig/button3.gif" border="2" align="middle" width="30" height="30" alt="3">' \
			'<img src="/htdig/button4.gif" border="2" align="middle" width="30" height="30" alt="4">' \
			'<img src="/htdig/button5.gif" border="2" align="middle" width="30" height="30" alt="5">' \
			'<img src="/htdig/button6.gif" border="2" align="middle" width="30" height="30" alt="6">' \
			'<img src="/htdig/button7.gif" border="2" align="middle" width="30" height="30" alt="7">' \
			'<img src="/htdig/button8.gif" border="2" align="middle" width="30" height="30" alt="8">' \
			'<img src="/htdig/button9.gif" border="2" align="middle" width="30" height="30" alt="9">' \
			'<img src="/htdig/button10.gif" border="2" align="middle" width="30" height="30" alt="10">'
synonym_db:		/usr/lib/htdig/synonyms.db
endings_root2word_db:	/usr/lib/htdig/root2word.db
endings_word2root_db:	/usr/lib/htdig/word2root.db
image_url_prefix:	/htdig
--- End /etc/htdig/htdig.conf