#458939 allow search engines to index http://bugs.debian.org

#458939#5
Date:
2008-01-03 19:40:12 UTC
From:
To:
Package: www.debian.org
Severity: wishlist

Please allow search engines to index http://bugs.debian.org.  This can
be done by deleting the file http://bugs.debian.org/robots.txt.

Cheers,

#458939#10
Date:
2008-01-03 20:18:24 UTC
From:
To:
OTOH, it is probably best to wait until http://bugs.debian.org/459843
("add email masking to debbugs web interface") is done before you do
anything.

Cheers,

#458939#15
Date:
2008-01-03 20:18:36 UTC
From:
To:
reassign 458939 bugs.debian.org
thanks

Hello, the right pseudopackage is bugs.debian.org
See http://debian.org/Bugs/pseudo-packages

By the way, I think you should give some reasons for doing what you
propose :)

Best regards,

#458939#22
Date:
2008-01-03 21:07:15 UTC
From:
To:
Just for the record, the reasons why we disallow indexing are because
the robots.txt specification isn't complete enough to specify a
maximum scan rate for specific portions of the site sufficient to
allow us to actually allow bots to access the site without degrading
performance for other users of the site.

There are already mirrors which allow indexing, and you can use the
BTS's own search engine which is far superior to gooogle (or any other
search engine which doesn't have access to internal metadata) in this
regard.


Don Armstrong

#458939#27
Date:
2008-01-03 21:52:39 UTC
From:
To:
FWIW I saw this recently:
http://en.wikipedia.org/wiki/Wikipedia:Database_download

# Inktomi's "Slurp" can read a minimum delay between hits; if your
# bot supports such a thing using the 'Crawl-delay' or another
# instruction, please let us know.
[...]
## *at least* 1 second please. preferably more :D
## we're disabling this experimentally 11-09-2006
#Crawl-delay: 1

But, it may not be widely supported/respected, and you seem to want
the crawl rate to vary across the site.

Justin

#458939#32
Date:
2008-01-03 22:13:08 UTC
From:
To:
2008/1/3, Don Armstrong <don@debian.org> wrote:

http://en.wikipedia.org/wiki/Robots.txt#Crawl-delay_directive will
help.  Yahoo and MSNBot both support it.  I bet other major bots
support it too.  So we can allow Yahoo and MSNBot (plus Googlebot, if
they support it too) and block everyone else.

Using my browser's default search engine is more convenient.  :)
Also, most users assume that web search engines index everything.
They may waste time searching the web before realizing that bugs.d.o
is unindexed.

Cheers,

#458939#37
Date:
2008-01-03 22:34:43 UTC
From:
To:
Google doesn't, unfortunatly. I've already had a bit of a discussion
about this particular issue with them. [The other half of the problem
is that the Crawl delay directive is per-bot, and not indexer-wide.]


Don Armstrong

#458939#42
Date:
2008-01-04 07:49:08 UTC
From:
To:
Most of the content is generated dynamically nowadays and this file has
been put in place because web crawlers have been known to severly hit the
machine hosting the BTS...

I'll let BTS admins close the bug or tag it wontfix.

Cheers,

#458939#47
Date:
2008-01-04 16:47:53 UTC
From:
To:
2008/1/3, Don Armstrong <don@debian.org> wrote:

That is unfortunate.  What did they tell you?  (Could you forward us
their last message?)

Though they do provide their Google Webmaster Tools which allow you to
adjust the crawl rate to "slower".[1]

Does that make it impractical for us to allow robots?

Cheers,
-j

^[1] http://www.google.com/webmasters/tour/tour4.html

#458939#52
Date:
2008-01-09 07:58:34 UTC
From:
To:
Uh, you're kidding right? The BTS's own search engine won't turn up hits
outside the BTS, as a trivial example...

AFAIK it was put in place when we first went dynamic, when bugs.d.o was
on master and horribly overloaded (so much so that updating the static
pages was taking over half a day).

It hasn't been removed ultimately because the CGIs provide too many
similar urls that shouldn't all be indexed; it's definitely a bug that
we don't provide some URLs that can be indexed.

Hacking around that in robots.txt seems tricky, as you can only reliably
specify Disallow: prefixes in robots.txt. Google supports "*" matches and
"$" to match against end of string and Allow: fields, and at least "*"
seems somewhat common, so something like this could work:

	Disallow: /*/       # exclude everything but the shortcuts
	Allow: /cgi-bin/bugreport.cgi?bug=
	Allow: /cgi-bin/pkgreport.cgi?pkg=*;dist=unstable$

That doesn't prevent bug=1234;reverse=yes and such, but I can't see a good
way of doing that.

I've set that up on rietz for Googlebot, we'll see if it works ok. I
don't think it's possible to make "Disallow: /*/" be the default for
all User-Agents since "*" is an extension, but extending it to MSN and
Yahoo should be fine.

Getting smarturl.cgi properly done is still probably the real solution.

Cheers,
aj

#458939#57
Date:
2008-01-09 16:33:51 UTC
From:
To:
Okay, so I've made smaturl.cgi work again; it was broken by:

   - Debbugs::CGI not accepting params from ARGV (smarturl.cgi changed
     to set QUERY_STRING)

   - Debbugs::CGI, pkgreport.cgi and version.cgi assuming the CGI's are in
     the current HTTP path (added "/cgi-bin/")

I've made those changes on rietz directly; what's the procedure
for committing them? "sudo -u debbugs -H bzr commit" ? There was a
pre-existing change in pkgreport.cgi (adding a"^" to the "Go away"
regexp) that also wasn't committed fwiw.

I think the best solution is to deal with URL naming in the long term
as follows:

   bugs.debian.org/123456          (bug report)
   bugs.debian.org/123456/mbox     (full bug mbox format)
   bugs.debian.org/123456/10       (individual message)
   bugs.debian.org/123456/10/mbox  (individual message mbox format)
   bugs.debian.org/123456/10/att/3 (attachment to a message)

   bugs.debian.org/source/dpkg     (bugs against dpkg in unstable)
   bugs.debian.org/package/dpkg

   bugs.debian.org/source/dpkg/1.14.14   (bugs against dpkg 1.14.14)
   bugs.debian.org/package/dpkg/1.14.14

   bugs.debian.org/usertag/debian-release@lists.debian.org
   bugs.debian.org/usertag/debian-release@lists.debian.org/rc-arm

   bugs.debian.org/maint/debian-dpkg@lists.debian.org
   bugs.debian.org/submitter/aj@azure.humbug.org.au
   bugs.debian.org/severity/serious
   bugs.debian.org/tag/lenny-ignore

These should all accept settings like boring=yes, reverse=yes,
repeatmerged=no from cookies, but _shouldn't_ accept any parameters on
the URL. That is, these should be the default views everyone gets and
per-user configuration should be done with cookies.

Only when you want to look at a customised version of a particular
page (like "show me this bugreport reversed") or more complicated
queries ("show me bugs with these three tags set") should you hit
/cgi-bin/pkgreport.cgi URLs.

As such, internal links from bug pages back to package pages and so on
should simply use the smarturl urls above, and not worry about all the
parameter parsing.

At that point, we should make smarturl.cgi active, and only prevent bots
from indexing /cgi-bin afaics.

Does that sound reasonable?

Cheers,
aj

#458939#62
Date:
2008-01-09 20:54:32 UTC
From:
To:
It's far superior to google for searching for results *in* the BTS.
That's obviously the subtext of my statement.

It's not clear that that's actually the right way forward, but it may
be a solution.


Don Armstrong

#458939#67
Date:
2008-01-09 20:56:17 UTC
From:
To:
Right, that's the way to commit it. smarturl.cgi should also actually
be modified to use the right configuration setup and should stop
declaring itself as part of the debbugs packages too. See the other
cgi scripts for examples.


Don Armstrong

#458939#72
Date:
2008-01-10 06:14:06 UTC
From:
To:
Well, that's certainly true while google's directed to not index results
in the BTS; if that weren't the case Google would do a respectable
job. But the more common case (IME) is when you don't have a site:
restriction and you're happy to have external pages appear if they're
relevant (eg, related launchpad bugs, or discussion on lists that didn't
get Cc'ed to the bug, etc). Likewise, the BTS search engine can't take
into account external hints about which bugs are more likely to be
relevant, while Google and other web search engines can.

(In practice, with google barely indexing anything in the BTS yet; lookup
for bug#459818 by googling for `medium dhclient-script' works fine;
using hyperstraier on merkel takes ages and doesn't return any hits)

It's certainly /a/ solution. What else is an option?

I think any solution should at least provide simple to access
pages for bug#, package and source with at most one subdirectory
(bugs.d.o/bug/123413, eg), and they should be dynamically generated,
which means you've got to have a CGI script determining what to display
based on pathnames. I'd call anything doing that "smarturl", pretty much.

Cheers,
aj

#458939#77
Date:
2008-01-10 06:16:49 UTC
From:
To:
Disallow: /cgi-bin
Allow: /cgi-bin/bugreport.cgi?bug=
Allow: /cgi-bin/pkgreport.cgi?pkg=*;dist=unstable$

would probably work okay, actually.

Cheers,
aj

#458939#82
Date:
2008-01-10 09:55:21 UTC
From:
To:
That's because you actually meant to search for 'medium AND
dhclient-script' not the phrase, medium dhclient-script, which doesn't
appear in the BTS at all.

http://merkel.debian.org/~don/cgi/search.cgi?phrase=medium+AND+dhclient-script&search=search


Don Armstrong

#458939#87
Date:
2008-01-10 09:58:34 UTC
From:
To:
Sorry, I wasn't thinking here at all. The proper method is to use the
update_branch.sh comamnd in the source directory to merge in changes
that you've made in your own branch into a copy of the running brnach,
and then commit and exit the subshell.

This minimizes the time that intermediate non-compatible changes are
around and also allows you to rapidly revert to a previous commit (as
well as mailing out diffs).


Don Armstrong

#458939#92
Date:
2008-05-18 21:46:05 UTC
From:
To:
So right now Google is allowed to spider bugs.debian.org, but other
search engines are not. Sounds discriminating.

Perhaps it could be extracted from web server logs to see how much load
does the Googlebot make?

If the numbers are not very significant, other spiders could be allowed,
couldn't they?


Tomasz Chmielewski
http://blog.wpkg.org

#458939#95
Date:
2008-05-18 21:48:01 UTC
From:
To:
So right now Google is allowed to spider bugs.debian.org, but other
search engines are not. Sounds discriminating.

Perhaps it could be extracted from web server logs to see how much load
does the Googlebot make?

If the numbers are not very significant, other spiders could be allowed,
couldn't they?


Tomasz Chmielewski
http://blog.wpkg.org

#458939#100
Date:
2015-05-25 20:45:47 UTC
From:
To:
     So, now in 2015, is it still necessary to block some bots and some
URLs or should everything be opened or should this bug be closed or...?
     Just a "ping" :-).

#458939#105
Date:
2017-02-13 00:13:52 UTC
From:
To:
Dear Customer,

UPS courier was unable to contact you for your parcel delivery.

Please check the attachment for details!

Kind thoughts,
Terry Massey,
UPS Mail Delivery Agent.

#458939#110
Date:
2022-06-23 09:24:59 UTC
From:
To:
Hello,

I am contacting you for a possible partnership to transfer US$25,
000,000.00 to your country for investment under your assistance. Your
sincere cooperation is anticipated and I will furnish you with the details,
upon receipt of your response to this email address: jjj.joshua@yandex.com

My regards,

John Joshua.