GNOME Bugzilla – Bug 142505
Invite google to index bugs
Last modified: 2018-06-16 13:27:45 UTC
Somehow Google should be made into indexing the bug reports. I imagine this could be done by making a "?"-free URL for bugs and a set of ditto index pages.
I seem to recall that we had some problems with google beating up the machine rather severely last time we let it talk to bugzilla. The new hardware may eliminate/reduce this issue, but I don't think anybody has looked into it.
Why is this useful, out of curiousity?
It useful because google is far better at general searching than bugzilla's query.
Hum. I'm wondering if it might be simpler to make a "Simple search box" on the main page like modern b.m.o does, since that's all google would be doing. Not adding to google would also help wrt keeping loads of e-mail addresses out of google - I'm fairly sure people use google to find spam targets.
Calling google a simple text search is absurd. They spend a good deal of effort on ordering hits well, and they by and large do a fine job. I don't see a "simple search box" solving the issue of finding things. (And how would anyone but google themselves find email addresses from it?)
Google would help greatly. Then the query and search pages could have a link that would do a Google search. This would be far better than the current search ability. The nice thing about a Google search is that it works using all of the data. Google won't care if keywords are in the summary, initial report, comments, comment headers, bug attributes, or whereever. Google just does the search.
Bugzilla could provide an added value to Google because I use it search for general bugs in my software. Often I have later found a bug report in bugzilla that talks about exactly this problem. Another problem is that when searching through bugzilla I often have to include all bugs (including closed ones), because I'm using an older version of the software and just want to know if that issue is already fixed and/or can be worked around. For such queries Google is waaaaay faster.
FWIW, Greg's comment is correct- google brought bugzilla to its knees regularly until we figured out what the problem was and updated robots.txt. This would only work if you did a regular static dump of the bugs somewhere and encouraged google to index that. [FWIW, I'm not really clear on the utility of this, given the inevitable lag, etc., etc., but I'm not sure I see much harm either, as long as we strongly encourage bug hunters to search locally as well as remotely.] [Hrm... alternately, maybe we could talk google into giving us a http://www.google.com/enterprise/gsa/ ;)
If we do this, I'd like to defang email addresses and remove the mailto links. Actually, I think I'd like to do that anyway unless people can think of a good reason not to...
I don't find them very useful very often, but when I need them, they are irreplaceable. Maybe only show them if you are logged in, which presumably spammers wouldn't do/be able to do?
Redhat does what Luis describes. I'd like to do that when we upgrade.
That one should /so/ be an upstream default :)
Hrm...not one mid-air collision but two. Don't know if my comment is useful anymore, but it may be so I'll post it anyway: sounds reasonable... What about bug-buddy reporters without an account though? That is added as part of a longdesc, so there's a couple options: (1) parse all longdesc's on the fly looking email-like looking strings and replace the @'s and .'s with spaces, (2) change the message added to the initial long description to add the reporter's email without the @'s and .'s, (3) do -2- but also add an extra mailto field somewhere for those that are logged in.
The main things you can do to reduce the load google puts on a website is to send correct caching headers with the response, and short-circuit processing if the client does a validity check and nothing has changed. The main headers that should be sent back include: * ETag - an identifier for a particular version of a page that can be used to validate that the page is up-to-date. The last-changed-date combined with some info from the cookies could probably be used as a weak validator here (it isn't a strong validator, because the bugzilla templates could change, or a referenced bug could change). * Last-Modified - used for response validation in HTTP/1.0 clients. Also used by clients to guess expiry dates if one isn't provided. Must be strictly increasing as changes are made. Again, the last-changed-date should work here. * Expires - specify a date when the cached copy needs to be revalidated. Could probably use some heuristics based on the last-changed-date (if a bug has been changed recently, it will probably change again soon, and vice versa). * "Cache-Control: max-age=NNNN" - similar to above, but you can specify age in seconds. Overrides Expires header in HTTP/1.1 clients. * "Cache-Control: public" - make sure that pages are cacheable for unauthenticated connections. For authenticated clients, use "Cache-Control: private", so that shared caches don't hold onto them. * "Vary: Cookie" - since cookie auth is being used, indicate that the response will change depending on the value of the cookies. If the client sends an If-Modified-Since or If-None-Match header, the bugzilla code should try and find out whether the page hasn't changed as soon as possible, and write out a "304 Not Modified" response as soon as possible (which probably still needs the database connection). Getting all this right is a pretty big job, but would probably result in a performance boost for normal users too (the browser would be able to cache more responses).
Has anyone considered the new Google Sitemap API? http://www.google.com/webmasters/sitemaps/docs/en/about.html If I'm not mistaken, it is a API that allows site managers to explicitly update Google's index without needing to have the actual crawler to sweep your entire content.
I'm opposed to google indexing bugzilla. So far I had made the objections known only on IRC; but I've been invited to share my concerns here on the bug. First, as a current bz user, I know bz isn't indexed, so I have made my comments with the understanding that they will not end up on google. That means that any future google indexing must not cover existing bugs and comments, but only ones created after indexing was turned on. I have 2 main reasons to oppose indexing: submitter privacy, and my online reputation. * Privacy: Submitters may not realise their bug reports will end up being world-readable and, worse, google indexed. Bug reports, esp. crashes, may contain sensitive private information, e.g. passwords, names etc, but also the infamous pr0n filenames and URLs in totem crash reports. By the time they realise and try to get the private info removed, google may already have indexed it. * Online reputation: Bug reports often contain inappropriate text. Examples include the already-cited pr0n URLs and filenames, but also the answers to the "what were you doing when the programme crashed" question which vary from X-rated phrases/activites to describing clearly illegal conduct the submitter has engaged in. By allowing indexing the bug report in google, I would be linked to that in google searches. That is completely inacceptable to me. So how could one go about designing indexing in a way that addresses these concerns? * New bugs start as do-not-index. Only if the submitter has checked the "allow google indexing" box in bug-buddy, and a triager has checked the report for sensitive text and set the allow-indexing flag, the bug is allowed to be indexed. * Have a "allow indexing my comments" opt-in checkbox in the bz user prefs page. If any commenter on a bug has not checked it, don't allow indexing (if such a comment is only added after indexing, don't allow to re-index). Note that it's not enough to just remove these specific comments in the page as served to the google indexer, but the whole bug from that point on, since later commenters may cite part of that previous comment, or address their comment to the previous commenter using his bugzilla name or email address.
I forgot to make clear that the link in my 2nd reason would be from me commenting on those bug reports, or triaging them, fixing them, or just plain bz maintenance work like re-assigning, mass-closing NEEDINFO bugs, target milestone updating etc.
A couple things bear mentioning here. 1. The search function in bugzilla is mediocre at best. No wonder we have so many duplicates - there's no good way to search for them except the bugzilla advanced search (which is crazy complicated). Allowing searches in google could only help this problem. 2. With regards to the issues with Google bringing bugzilla to its knees, I'd have to see it to believe it. I'd bet that the indexing will be huge, especially at first, but that the savings in search overhead will be huge too. For example, go do a search for something complicated using the simple search tool. I just did "Open with", and it took about three minutes to come back blank. Google would have taken less than a second. Literally. In any case, if this is really a problem, you can configure the crawl rate of Google (http://www.google.com/support/webmasters/bin/answer.py?hl=en&answer=48620). 3. I looked at the robots.txt. Is this really necessary: User-agent: * Allow: /index.cgi Disallow: / # Any robot like behaviour against this site is not allowed # This includes Wget, etc # # If you do want to do this, please contact bugmaster@gnome.org # with your contact details so we can start legal action # against you 4. robots.txt is OPTIONAL! If we are posting information that is private, we shouldn't be. Robots.txt is an awful way to protect data. 5. With that in mind, we should be hiding email addresses, having a contact form, and opening the site to Google. Or something like that. I say all this to get conversation going!
One other thing that occurred to me is that it is EXTREMELY useful to find a relevant bug report when you are having problems with software. For example, if something isn't working the way I think it should be, I google for a solution. Sometimes, I find that somebody has linked to a bug report here, and that is the "solution" to my problem. Other times, months later I'll discover that the problem I couldn't solve was in fact a bug that had been filed long ago, but which google couldn't find! Wouldn't it be better if bug reports came up when you googled for problems? I think so, personally. Then you'd know to subscribe to the bug, and that the developers know about the problem. Imagine this to the extreme, if, whenever you had an issue with a program, you had to find its bug tracker and see if your issue is a bug. Isn't it easier just to use Google?
Regarding comment 18. 2. Waiting to bad stuff to happen is not how I allow bgo to be treated. In the past Google made bgo inaccessible. 4. Robots.txt is basic net behaviour. The contents is referring to crawling the site. That you make it about privacy is irrelevant. That said, you didn't address the only objection that I have: chpe doesn't want it.
About number 2, I'd say go for it with the crawl rate configured as Google recommends. That was four years ago (apparently?) that Google brought it down. Perhaps things would be better now? I'd say that's worth testing. Anyway, in 2004, there was new hardware that could maybe handle it, right? About number 4, yeah, it's true that robots.txt is basic net behavior, but in a couple places in this bug (e.g. comment 4 and comment 16) people have mentioned that they don't want the site indexed because of privacy and security reasons. That sounds like a problem, since the kinds of people that would be looking for passwords, email addresses and the like are the same people that would ignore a robots.txt file. To me, this isn't a good argument for robots.txt, so much as a bad argument for why it's OK to put confidential information online. About chpe's objections, those are a bit more tricky. He mentioned two points, but it sounds like the heart of the problem is bug buddy, which certainly is a problem that should be addressed. If it's asking people confidential information and posting that information to the Internet, that's a HUGE bug, and robots.txt is like a leaky valve on a fire hose. Comment 13 seems to have a good point that if this is the case, we should parse out the email addresses on the fly. I tried just now to see if this is filed as a bug...but...the search tool failed me, and returned a blank page. To chpe's objection about being associated with p0rn and such in Google, I'd say that he's already associated with it in bugzilla, and that if he really wanted not to be associated with something on the Internet, he shouldn't have _associated_ with it on the Internet. For comments where he is associated by triage or other comment, it shouldn't be too much of a concern to his reputation. If he is doing something in bugzilla that would tarnish his reputation, then I'd say he shouldn't have done those actions without donning the cloak of anonymity that the Internet provides (I'm guessing this isn't the case). So, I don't have much of a solution there, obviously, and maybe I don't don't understand that problem completely. He does have more creds around here than I do, but being indexed by search engines is basic net behavior too, or so I thought. Hmmm...complicated. One other solution could be to allow people time to change their login and personal information so they can anonymize it before any indexing happens.
I just have one more comment here, just to document how frustrating the search within bugzilla is. I just wanted to search for bugs having to do with evince and crash handling. In order to cover the bases, I had to search for "crash handle," "crash handler," "crash handling," and all down the line. It's frustrating as a new bug reporter that finding bugs is so challenging. The problem is that you want to behave well and not file a duplicate, but it's really hard to find duplicates in the first place.
I'm fully for this, and I think it would be tremendously useful to tons and tons of people who are searching for information about GNOME issues. We would only allow indexing show_bug.cgi, so basically Google would only see bugs linked from external parties. I'm pretty sure that the current Bugzilla could now handle being indexed by Google. If there are load problems, we can disable crawling fairly easily when they happen. I understand chpe's concern, but with some analysis, I don't think there is a serious personal privacy issue with allowing Google to index Bugzilla. Individual commenter names won't have a very high pagerank, I suspect, and once we add back in describeuser.cgi, the pagerank would go to the describeuser.cgi page anyhow, not any particular bug page. In any case, bugs get a high pagerank only if they're frequently linked from outside or from other bugs. Also, the excerpts that Google shows are the ones close to your name, which would be content you created yourself.
Anybody still investigating this? The only reasonably way I find bugs on bugzilla.gnome.org these days is by googling for a relevant bug on launchpad or on bugzilla.redhat.com, and hoping somebody's tied it to the correct bug here, which is a pretty frustrating way to find upstream.
I see lots and lots and lots of discussions in forums everywhere about bugs that have long been reported and analyzed. But the users and forum regulars involved in those discussion seem to be unaware of them. And I guess they are, because none of them turn up in searches they do.
Regarding indexing per se, Bugzilla is partially indexed - https://www.google.com/search?q=site%3Abugzilla.gnome.org+gnome However, those search results are not really useful unless you tell Google to not filter the "non-relevant" results, i.e. https://www.google.com/search?q=site%3Abugzilla.gnome.org+gnome&filter=0 Maybe it would make sense to help search engines to show the really relevant results - links to bugs with their summary - instead. DuckDuckGo is a bit better, i.e. it doesn't filter that aggressively: https://duckduckgo.com/?q=site%3Abugzilla.gnome.org+gnome If server load would still be an issue: A few hours ago on IRC (#sysadmin), I have suggested that a less invasive static mirror of Bugzilla for indexing purposes can IMO be maintained with very basic tools (a search that simply returns all bugs changed in the past hour, a cron job and a wget limited to show_bug.cgi with recursive depth set to 1). Regarding privacy: if any sensitive data is in publicly accessible bugs now, then this is the problem. If is not safe because fewer people can access it - all the wrong people have it already.
I have just allowed Googlebot access to /show_bug.cgi, and will be watching the load on our server over the coming days. This way I hope to get some information on whether or not our current setup is able to handle it, and if they don't hit our rate limiting system. Please note that when I notice the load rises too high, I will disable this again, but we will at least have some data on it. Regarding privacy: as long as Google is not logged in (and it isn't), it will only see the names of the people commenting on the bug, but no email addresses.
My concerns (detailed in in comment 16) aren't addressed with that at all. So why did you go ahead with this regardless?
I welcome this move. Your concerns can be addressed by marking comments as private. Also: We are (and were) quite clear about the fact that this is an *open* bug tracker and that "activity on most bugs, including email addresses, will be visible to the public" cf. https://bugzilla.gnome.org/createaccount.cgi If you have any further points to make, I guess upstream bugzilla is the more appropriate place.
Google is now indexing searches. I am not certain that is useful. Here's a sample I see: https://www.google.com/url?sa=t&rct=j&q=&esrc=s&source=web&cd=6&ved=0CEwQFjAF&url=https%3A%2F%2Fbugzilla.gnome.org%2Fbuglist.cgi%3Fbug_status%3DUNCONFIRMED%26bug_status%3DNEW%26bug_status%3DASSIGNED%26bug_status%3DREOPENED%26priority%3DNormal%26product%3Dgtk%252B%26query_format%3Dadvanced%26order%3Dbug_severity%252Cbug_id%26query_based_on%3D&ei=HOZcU5y5IobnsAS8jYCACg&usg=AFQjCNE2Ib_I24TkZ8dSYD62LcQdRIiIAQ&sig2=WE-hnKa67Ykuw433AE9DuQ&cad=rja I don't see any evidence that the actual bug reports show up in google results.
Patrick: Any updates regarding comment 27 and comment 30?
After https://wiki.gnome.org/Initiatives/DevelopmentInfrastructure , GNOME is moving its task tracking from Bugzilla to GitLab at https://gitlab.gnome.org/ as previously announced in https://mail.gnome.org/archives/desktop-devel-list/2018-May/msg00026.html . See https://wiki.gnome.org/GitLab for more information. Hence closing this ticket as WONTFIX: There are no plans to work on Bugzilla.