- Package:
- bugs.debian.org
- Source:
- bugs.debian.org
- Submitter:
- "Jason Spiro"
- Date:
- 2022-06-23 09:27:04 UTC
- Severity:
- wishlist
Package: www.debian.org Severity: wishlist Please allow search engines to index http://bugs.debian.org. This can be done by deleting the file http://bugs.debian.org/robots.txt. Cheers,
OTOH, it is probably best to wait until http://bugs.debian.org/459843
("add email masking to debbugs web interface") is done before you do
anything.
Cheers,
reassign 458939 bugs.debian.org thanks Hello, the right pseudopackage is bugs.debian.org See http://debian.org/Bugs/pseudo-packages By the way, I think you should give some reasons for doing what you propose :) Best regards,
Just for the record, the reasons why we disallow indexing are because the robots.txt specification isn't complete enough to specify a maximum scan rate for specific portions of the site sufficient to allow us to actually allow bots to access the site without degrading performance for other users of the site. There are already mirrors which allow indexing, and you can use the BTS's own search engine which is far superior to gooogle (or any other search engine which doesn't have access to internal metadata) in this regard. Don Armstrong
FWIW I saw this recently: http://en.wikipedia.org/wiki/Wikipedia:Database_download # Inktomi's "Slurp" can read a minimum delay between hits; if your # bot supports such a thing using the 'Crawl-delay' or another # instruction, please let us know. [...] ## *at least* 1 second please. preferably more :D ## we're disabling this experimentally 11-09-2006 #Crawl-delay: 1 But, it may not be widely supported/respected, and you seem to want the crawl rate to vary across the site. Justin
2008/1/3, Don Armstrong <don@debian.org> wrote: http://en.wikipedia.org/wiki/Robots.txt#Crawl-delay_directive will help. Yahoo and MSNBot both support it. I bet other major bots support it too. So we can allow Yahoo and MSNBot (plus Googlebot, if they support it too) and block everyone else. Using my browser's default search engine is more convenient. :) Also, most users assume that web search engines index everything. They may waste time searching the web before realizing that bugs.d.o is unindexed. Cheers,
Google doesn't, unfortunatly. I've already had a bit of a discussion about this particular issue with them. [The other half of the problem is that the Crawl delay directive is per-bot, and not indexer-wide.] Don Armstrong
Most of the content is generated dynamically nowadays and this file has been put in place because web crawlers have been known to severly hit the machine hosting the BTS... I'll let BTS admins close the bug or tag it wontfix. Cheers,
2008/1/3, Don Armstrong <don@debian.org> wrote: That is unfortunate. What did they tell you? (Could you forward us their last message?) Though they do provide their Google Webmaster Tools which allow you to adjust the crawl rate to "slower".[1] Does that make it impractical for us to allow robots? Cheers, -j ^[1] http://www.google.com/webmasters/tour/tour4.html
Uh, you're kidding right? The BTS's own search engine won't turn up hits outside the BTS, as a trivial example... AFAIK it was put in place when we first went dynamic, when bugs.d.o was on master and horribly overloaded (so much so that updating the static pages was taking over half a day). It hasn't been removed ultimately because the CGIs provide too many similar urls that shouldn't all be indexed; it's definitely a bug that we don't provide some URLs that can be indexed. Hacking around that in robots.txt seems tricky, as you can only reliably specify Disallow: prefixes in robots.txt. Google supports "*" matches and "$" to match against end of string and Allow: fields, and at least "*" seems somewhat common, so something like this could work: Disallow: /*/ # exclude everything but the shortcuts Allow: /cgi-bin/bugreport.cgi?bug= Allow: /cgi-bin/pkgreport.cgi?pkg=*;dist=unstable$ That doesn't prevent bug=1234;reverse=yes and such, but I can't see a good way of doing that. I've set that up on rietz for Googlebot, we'll see if it works ok. I don't think it's possible to make "Disallow: /*/" be the default for all User-Agents since "*" is an extension, but extending it to MSN and Yahoo should be fine. Getting smarturl.cgi properly done is still probably the real solution. Cheers, aj
Okay, so I've made smaturl.cgi work again; it was broken by:
- Debbugs::CGI not accepting params from ARGV (smarturl.cgi changed
to set QUERY_STRING)
- Debbugs::CGI, pkgreport.cgi and version.cgi assuming the CGI's are in
the current HTTP path (added "/cgi-bin/")
I've made those changes on rietz directly; what's the procedure
for committing them? "sudo -u debbugs -H bzr commit" ? There was a
pre-existing change in pkgreport.cgi (adding a"^" to the "Go away"
regexp) that also wasn't committed fwiw.
I think the best solution is to deal with URL naming in the long term
as follows:
bugs.debian.org/123456 (bug report)
bugs.debian.org/123456/mbox (full bug mbox format)
bugs.debian.org/123456/10 (individual message)
bugs.debian.org/123456/10/mbox (individual message mbox format)
bugs.debian.org/123456/10/att/3 (attachment to a message)
bugs.debian.org/source/dpkg (bugs against dpkg in unstable)
bugs.debian.org/package/dpkg
bugs.debian.org/source/dpkg/1.14.14 (bugs against dpkg 1.14.14)
bugs.debian.org/package/dpkg/1.14.14
bugs.debian.org/usertag/debian-release@lists.debian.org
bugs.debian.org/usertag/debian-release@lists.debian.org/rc-arm
bugs.debian.org/maint/debian-dpkg@lists.debian.org
bugs.debian.org/submitter/aj@azure.humbug.org.au
bugs.debian.org/severity/serious
bugs.debian.org/tag/lenny-ignore
These should all accept settings like boring=yes, reverse=yes,
repeatmerged=no from cookies, but _shouldn't_ accept any parameters on
the URL. That is, these should be the default views everyone gets and
per-user configuration should be done with cookies.
Only when you want to look at a customised version of a particular
page (like "show me this bugreport reversed") or more complicated
queries ("show me bugs with these three tags set") should you hit
/cgi-bin/pkgreport.cgi URLs.
As such, internal links from bug pages back to package pages and so on
should simply use the smarturl urls above, and not worry about all the
parameter parsing.
At that point, we should make smarturl.cgi active, and only prevent bots
from indexing /cgi-bin afaics.
Does that sound reasonable?
Cheers,
aj
It's far superior to google for searching for results *in* the BTS. That's obviously the subtext of my statement. It's not clear that that's actually the right way forward, but it may be a solution. Don Armstrong
Right, that's the way to commit it. smarturl.cgi should also actually be modified to use the right configuration setup and should stop declaring itself as part of the debbugs packages too. See the other cgi scripts for examples. Don Armstrong
Well, that's certainly true while google's directed to not index results in the BTS; if that weren't the case Google would do a respectable job. But the more common case (IME) is when you don't have a site: restriction and you're happy to have external pages appear if they're relevant (eg, related launchpad bugs, or discussion on lists that didn't get Cc'ed to the bug, etc). Likewise, the BTS search engine can't take into account external hints about which bugs are more likely to be relevant, while Google and other web search engines can. (In practice, with google barely indexing anything in the BTS yet; lookup for bug#459818 by googling for `medium dhclient-script' works fine; using hyperstraier on merkel takes ages and doesn't return any hits) It's certainly /a/ solution. What else is an option? I think any solution should at least provide simple to access pages for bug#, package and source with at most one subdirectory (bugs.d.o/bug/123413, eg), and they should be dynamically generated, which means you've got to have a CGI script determining what to display based on pathnames. I'd call anything doing that "smarturl", pretty much. Cheers, aj
Disallow: /cgi-bin Allow: /cgi-bin/bugreport.cgi?bug= Allow: /cgi-bin/pkgreport.cgi?pkg=*;dist=unstable$ would probably work okay, actually. Cheers, aj
That's because you actually meant to search for 'medium AND dhclient-script' not the phrase, medium dhclient-script, which doesn't appear in the BTS at all. http://merkel.debian.org/~don/cgi/search.cgi?phrase=medium+AND+dhclient-script&search=search Don Armstrong
Sorry, I wasn't thinking here at all. The proper method is to use the update_branch.sh comamnd in the source directory to merge in changes that you've made in your own branch into a copy of the running brnach, and then commit and exit the subshell. This minimizes the time that intermediate non-compatible changes are around and also allows you to rapidly revert to a previous commit (as well as mailing out diffs). Don Armstrong
So right now Google is allowed to spider bugs.debian.org, but other search engines are not. Sounds discriminating. Perhaps it could be extracted from web server logs to see how much load does the Googlebot make? If the numbers are not very significant, other spiders could be allowed, couldn't they? Tomasz Chmielewski http://blog.wpkg.org
So right now Google is allowed to spider bugs.debian.org, but other search engines are not. Sounds discriminating. Perhaps it could be extracted from web server logs to see how much load does the Googlebot make? If the numbers are not very significant, other spiders could be allowed, couldn't they? Tomasz Chmielewski http://blog.wpkg.org
So, now in 2015, is it still necessary to block some bots and some
URLs or should everything be opened or should this bug be closed or...?
Just a "ping" :-).
Dear Customer, UPS courier was unable to contact you for your parcel delivery. Please check the attachment for details! Kind thoughts, Terry Massey, UPS Mail Delivery Agent.
Hello, I am contacting you for a possible partnership to transfer US$25, 000,000.00 to your country for investment under your assistance. Your sincere cooperation is anticipated and I will furnish you with the details, upon receipt of your response to this email address: jjj.joshua@yandex.com My regards, John Joshua.