Bug 142505 – Invite google to index bugs

After an evaluation, GNOME has moved from Bugzilla to GitLab. Learn more about GitLab.
No new issues can be reported in GNOME Bugzilla anymore.
To report an issue in a GNOME project, go to GNOME GitLab.
Do not go to GNOME Gitlab for: Bluefish, Doxygen, GnuCash, GStreamer, java-gnome, LDTP, NetworkManager, Tomboy.

Bug 142505 - Invite google to index bugs


Summary:	Invite google to index bugs


Status:	RESOLVED WONTFIX

Product:	bugzilla.gnome.org
Classification:	Infrastructure
Component:	general
Version:	unspecified
Hardware:	Other All

Importance:	Normal enhancement
Target Milestone:	---
Assigned To:	Patrick Uiterwijk
QA Contact:	Bugzilla Maintainers

URL:
Whiteboard:

Depends on:
Blocks:

Reported:	2004-05-14 00:09 UTC by Morten Welinder
Modified:	2018-06-16 13:27 UTC

See Also:
GNOME target:	---
GNOME version:	---

Description Morten Welinder 2004-05-14 00:09:17 UTC

Somehow Google should be made into indexing the bug reports.

I imagine this could be done by making a "?"-free URL for bugs and a set
of ditto index pages.

Comment 1 Gregory Leblanc 2004-05-15 16:26:22 UTC

I seem to recall that we had some problems with google beating up the machine
rather severely last time we let it talk to bugzilla.  The new hardware may
eliminate/reduce this issue, but I don't think anybody has looked into it.

Comment 2 Andrew Sobala 2004-05-15 17:51:41 UTC

Why is this useful, out of curiousity?

Comment 3 Morten Welinder 2004-05-15 18:57:41 UTC

It useful because google is far better at general searching than bugzilla's
query.

Comment 4 Andrew Sobala 2004-05-15 19:06:28 UTC

Hum. I'm wondering if it might be simpler to make a "Simple search box" on the
main page like modern b.m.o does, since that's all google would be doing.

Not adding to google would also help wrt keeping loads of e-mail addresses out
of google - I'm fairly sure people use google to find spam targets.

Comment 5 Morten Welinder 2004-05-16 01:35:03 UTC

Calling google a simple text search is absurd.  They spend a good deal of effort
on ordering hits well, and they by and large do a fine job.  I don't see a
"simple search box" solving the issue of finding things.

(And how would anyone but google themselves find email addresses from it?)

Comment 6 Albert Cahalan 2004-11-13 18:40:30 UTC

Google would help greatly. Then the query and search pages
could have a link that would do a Google search. This would
be far better than the current search ability.

The nice thing about a Google search is that it works using
all of the data. Google won't care if keywords are in the
summary, initial report, comments, comment headers, bug
attributes, or whereever. Google just does the search.

Comment 7 Benjamin Otte (Company) 2005-03-18 16:34:25 UTC

Bugzilla could provide an added value to Google because I use it search for
general bugs in my software.
Often I have later found a bug report in bugzilla that talks about exactly this
problem.

Another problem is that when searching through bugzilla I often have to include
all bugs (including closed ones), because I'm using an older version of the
software and just want to know if that issue is already fixed and/or can be
worked around. For such queries Google is waaaaay faster.

Comment 8 Luis Villa 2005-03-18 16:41:00 UTC

FWIW, Greg's comment is correct- google brought bugzilla to its knees regularly
until we figured out what the problem was and updated robots.txt. This would
only work if you did a regular static dump of the bugs somewhere and encouraged
google to index that.

[FWIW, I'm not really clear on the utility of this, given the inevitable lag,
etc., etc., but I'm not sure I see much harm either, as long as we strongly
encourage bug hunters to search locally as well as remotely.]

[Hrm... alternately, maybe we could talk google into giving us a
http://www.google.com/enterprise/gsa/ ;)

Comment 9 Elijah Newren 2005-03-18 16:47:10 UTC

If we do this, I'd like to defang email addresses and remove the mailto links. 
Actually, I think I'd like to do that anyway unless people can think of a good
reason not to...

Comment 10 Luis Villa 2005-03-18 16:52:11 UTC

I don't find them very useful very often, but when I need them, they are
irreplaceable. Maybe only show them if you are logged in, which presumably
spammers wouldn't do/be able to do?

Comment 11 Olav Vitters 2005-03-18 16:53:28 UTC

Redhat does what Luis describes. I'd like to do that when we upgrade.

Comment 12 Luis Villa 2005-03-18 16:56:28 UTC

That one should /so/ be an upstream default :)

Comment 13 Elijah Newren 2005-03-18 16:59:49 UTC

Hrm...not one mid-air collision but two.  Don't know if my comment is useful
anymore, but it may be so I'll post it anyway:

sounds reasonable...  What about bug-buddy reporters without an account though?
 That is added as part of a longdesc, so there's a couple options: (1) parse all
longdesc's on the fly looking email-like looking strings and replace the @'s and
.'s with spaces, (2) change the message added to the initial long description to
add the reporter's email without the @'s and .'s, (3) do -2- but also add an
extra mailto field somewhere for those that are logged in.

Comment 14 James Henstridge 2005-03-21 10:15:34 UTC

The main things you can do to reduce the load google puts on a website is to
send correct caching headers with the response, and short-circuit processing if
the client does a validity check and nothing has changed.

The main headers that should be sent back include:
 * ETag - an identifier for a particular version of a page that can be used to
   validate that the page is up-to-date.  The last-changed-date combined with
   some info from the cookies could probably be used as a weak validator here
   (it isn't a strong validator, because the bugzilla templates could change,
   or a referenced bug could change).

 * Last-Modified - used for response validation in HTTP/1.0 clients.  Also used
   by clients to guess expiry dates if one isn't provided.  Must be strictly
   increasing as changes are made.  Again, the last-changed-date should work
   here.

 * Expires - specify a date when the cached copy needs to be revalidated.
   Could probably use some heuristics based on the last-changed-date (if
   a bug has been changed recently, it will probably change again soon, and
   vice versa).

 * "Cache-Control: max-age=NNNN" - similar to above, but you can specify age
   in seconds.  Overrides Expires header in HTTP/1.1 clients.

 * "Cache-Control: public" - make sure that pages are cacheable for
   unauthenticated connections.  For authenticated clients, use
   "Cache-Control: private", so that shared caches don't hold onto them.

 * "Vary: Cookie" - since cookie auth is being used, indicate that the response
   will change depending on the value of the cookies.

If the client sends an If-Modified-Since or If-None-Match header, the bugzilla
code should try and find out whether the page hasn't changed as soon as
possible, and write out a "304 Not Modified" response as soon as possible (which
probably still needs the database connection).

Getting all this right is a pretty big job, but would probably result in a
performance boost for normal users too (the browser would be able to cache more
responses).

Comment 15 Rodrigo Vieira Couto 2005-09-01 02:13:41 UTC

Has anyone considered the new Google Sitemap API?

http://www.google.com/webmasters/sitemaps/docs/en/about.html

If I'm not mistaken, it is a API that allows site managers to explicitly update
Google's index without needing to have the actual crawler to sweep your entire
content.

Comment 16 Christian Persch 2008-12-11 19:12:36 UTC

I'm opposed to google indexing bugzilla.

So far I had made the objections known only on IRC; but I've been invited to share my concerns here on the bug.

First, as a current bz user, I know bz isn't indexed, so I have made my comments with the understanding that they will not end up on google. That means that any future google indexing must not cover existing bugs and comments, but only ones created after indexing was turned on.

I have 2 main reasons to oppose indexing: submitter privacy, and my online reputation.

* Privacy:
  Submitters may not realise their bug reports will end up being world-readable and, worse, google indexed. Bug reports, esp. crashes, may contain sensitive private information, e.g. passwords, names etc, but also the infamous pr0n filenames and URLs in totem crash reports. By the time they realise and try to get the private info removed, google may already have indexed it.

* Online reputation: 
  Bug reports often contain inappropriate text. Examples include the already-cited pr0n URLs and filenames, but also the answers to the "what were you doing when the programme crashed" question which vary from X-rated phrases/activites to describing clearly illegal conduct the submitter has engaged in. By allowing indexing the bug report in google, I would be linked to that in google searches. That is completely inacceptable to me.


So how could one go about designing indexing in a way that addresses these concerns?

* New bugs start as do-not-index. Only if the submitter has checked the "allow google indexing" box in bug-buddy, and a triager has checked the report for sensitive text and set the allow-indexing flag, the bug is allowed to be indexed.
* Have a "allow indexing my comments" opt-in checkbox in the bz user prefs page. If any commenter on a bug has not checked it, don't allow indexing (if such a comment is only added after indexing, don't allow to re-index). Note that it's not enough to just remove these specific comments in the page as served to the google indexer, but the whole bug from that point on, since later commenters may cite part of that previous comment, or address their comment to the previous commenter using his bugzilla name or email address.

Comment 17 Christian Persch 2008-12-11 19:16:26 UTC

I forgot to make clear that the link in my 2nd reason would be from me commenting on those bug reports, or triaging them, fixing them, or just plain bz maintenance work like re-assigning, mass-closing NEEDINFO bugs, target milestone updating etc.

Comment 18 Mike 2009-03-13 18:15:28 UTC

A couple things bear mentioning here.

1. The search function in bugzilla is mediocre at best. No wonder we have so many duplicates - there's no good way to search for them except the bugzilla advanced search (which is crazy complicated). Allowing searches in google could only help this problem.

2. With regards to the issues with Google bringing bugzilla to its knees, I'd have to see it to believe it. I'd bet that the indexing will be huge, especially at first, but that the savings in search overhead will be huge too. For example, go do a search for something complicated using the simple search tool. I just did "Open with", and it took about three minutes to come back blank. Google would have taken less than a second. Literally.

In any case, if this is really a problem, you can configure the crawl rate of Google (http://www.google.com/support/webmasters/bin/answer.py?hl=en&answer=48620).

3. I looked at the robots.txt. Is this really necessary:

User-agent: *
Allow: /index.cgi
Disallow: /

# Any robot like behaviour against this site is not allowed
# This includes Wget, etc
#
# If you do want to do this, please contact bugmaster@gnome.org
# with your contact details so we can start legal action
# against you

4. robots.txt is OPTIONAL! If we are posting information that is private, we shouldn't be. Robots.txt is an awful way to protect data.

5. With that in mind, we should be hiding email addresses, having a contact form, and opening the site to Google.

Or something like that. I say all this to get conversation going!

Comment 19 Mike 2009-03-13 21:53:45 UTC

One other thing that occurred to me is that it is EXTREMELY useful to find a relevant bug report when you are having problems with software. 

For example, if something isn't working the way I think it should be, I google for a solution. Sometimes, I find that somebody has linked to a bug report here, and that is the "solution" to my problem. Other times, months later I'll discover that the problem I couldn't solve was in fact a bug that had been filed long ago, but which google couldn't find!

Wouldn't it be better if bug reports came up when you googled for problems? I think so, personally. Then you'd know to subscribe to the bug, and that the developers know about the problem.

Imagine this to the extreme, if, whenever you had an issue with a program, you had to find its bug tracker and see if your issue is a bug. Isn't it easier just to use Google?

Comment 20 Olav Vitters 2009-03-14 03:03:51 UTC

Regarding comment 18.

2. Waiting to bad stuff to happen is not how I allow bgo to be treated. In the past Google made bgo inaccessible.

4. Robots.txt is basic net behaviour. The contents is referring to crawling the site. That you make it about privacy is irrelevant.

That said, you didn't address the only objection that I have: chpe doesn't want it.

Comment 21 Mike 2009-03-17 07:57:36 UTC

About number 2, I'd say go for it with the crawl rate configured as Google recommends. That was four years ago (apparently?) that Google brought it down. Perhaps things would be better now? I'd say that's worth testing. Anyway, in 2004, there was new hardware that could maybe handle it, right?

About number 4, yeah, it's true that robots.txt is basic net behavior, but in a couple places in this bug (e.g. comment 4 and comment 16) people have mentioned that they don't want the site indexed because of privacy and security reasons. That sounds like a problem, since the kinds of people that would be looking for passwords, email addresses and the like are the same people that would ignore a robots.txt file. To me, this isn't a good argument for robots.txt, so much as a bad argument for why it's OK to put confidential information online.

About chpe's objections, those are a bit more tricky. He mentioned two points, but it sounds like the heart of the problem is bug buddy, which certainly is a problem that should be addressed. If it's asking people confidential information and posting that information to the Internet, that's a HUGE bug, and robots.txt is like a leaky valve on a fire hose. Comment 13 seems to have a good point that if this is the case, we should parse out the email addresses on the fly. I tried just now to see if this is filed as a bug...but...the search tool failed me, and returned a blank page.

To chpe's objection about being associated with p0rn and such in Google, I'd say that he's already associated with it in bugzilla, and that if he really wanted not to be associated with something on the Internet, he shouldn't have _associated_ with it on the Internet. For comments where he is associated by triage or other comment, it shouldn't be too much of a concern to his reputation. If he is doing something in bugzilla that would tarnish his reputation, then I'd say he shouldn't have done those actions without donning the cloak of anonymity that the Internet provides (I'm guessing this isn't the case). So, I don't have much of a solution there, obviously, and maybe I don't don't understand that problem completely. He does have more creds around here than I do, but being indexed by search engines is basic net behavior too, or so I thought.

Hmmm...complicated. One other solution could be to allow people time to change their login and personal information so they can anonymize it before any indexing happens.

Comment 22 Mike 2009-04-14 02:16:49 UTC

I just have one more comment here, just to document how frustrating the search within bugzilla is. 

I just wanted to search for bugs having to do with evince and crash handling. In order to cover the bases, I had to search for "crash handle," "crash handler," "crash handling," and all down the line.

It's frustrating as a new bug reporter that finding bugs is so challenging. The problem is that you want to behave well and not file a duplicate, but it's really hard to find duplicates in the first place.

Comment 23 Max Kanat-Alexander 2009-08-21 05:27:20 UTC

I'm fully for this, and I think it would be tremendously useful to tons and tons of people who are searching for information about GNOME issues. We would only allow indexing show_bug.cgi, so basically Google would only see bugs linked from external parties.

I'm pretty sure that the current Bugzilla could now handle being indexed by Google. If there are load problems, we can disable crawling fairly easily when they happen.

I understand chpe's concern, but with some analysis, I don't think there is a serious personal privacy issue with allowing Google to index Bugzilla. Individual commenter names won't have a very high pagerank, I suspect, and once we add back in describeuser.cgi, the pagerank would go to the describeuser.cgi page anyhow, not any particular bug page. In any case, bugs get a high pagerank only if they're frequently linked from outside or from other bugs. Also, the excerpts that Google shows are the ones close to your name, which would be content you created yourself.

Comment 24 Jeremy Nickurak 2013-01-22 03:52:13 UTC

Anybody still investigating this?

The only reasonably way I find bugs on bugzilla.gnome.org these days is by googling for a relevant bug on launchpad or on bugzilla.redhat.com, and hoping somebody's tied it to the correct bug here, which is a pretty frustrating way to find upstream.

Comment 25 Michael Schumacher 2013-11-21 00:02:59 UTC

I see lots and lots and lots of discussions in forums everywhere about bugs that have long been reported and analyzed. But the users and forum regulars involved in those discussion seem to be unaware of them.

And I guess they are, because none of them turn up in searches they do.

Comment 26 Michael Schumacher 2013-11-21 09:23:57 UTC

Regarding indexing per se, Bugzilla is partially indexed - 
https://www.google.com/search?q=site%3Abugzilla.gnome.org+gnome

However, those search results are not really useful unless you tell Google to not filter the "non-relevant" results, i.e.

https://www.google.com/search?q=site%3Abugzilla.gnome.org+gnome&filter=0

Maybe it would make sense to help search engines to show the really relevant results - links to bugs with their summary - instead.

DuckDuckGo is a bit better, i.e. it doesn't filter that aggressively:
https://duckduckgo.com/?q=site%3Abugzilla.gnome.org+gnome



If server load would still be an issue:

A few hours ago on IRC (#sysadmin), I have suggested that a less invasive static mirror of Bugzilla for indexing purposes can IMO be maintained with very basic tools (a search that simply returns all bugs changed in the past hour, a cron job and a wget limited to show_bug.cgi with recursive depth set to 1).


Regarding privacy: if any sensitive data is in publicly accessible bugs now, then this is the problem. If is not safe because fewer people can access it - all the wrong people have it already.

Comment 27 Patrick Uiterwijk 2014-04-06 22:02:37 UTC

I have just allowed Googlebot access to /show_bug.cgi, and will be watching the load on our server over the coming days.
This way I hope to get some information on whether or not our current setup is able to handle it, and if they don't hit our rate limiting system.

Please note that when I notice the load rises too high, I will disable this again, but we will at least have some data on it.


Regarding privacy: as long as Google is not logged in (and it isn't), it will only see the names of the people commenting on the bug, but no email addresses.

Comment 28 Christian Persch 2014-04-07 05:44:40 UTC

My concerns (detailed in in comment 16) aren't addressed with that at all. So why did you go ahead with this regardless?

Comment 29 Tobias Mueller 2014-04-07 08:24:40 UTC

I welcome this move.

Your concerns can be addressed by marking comments as private.
Also: We are (and were) quite clear about the fact that this is an *open* bug tracker and that "activity on most bugs, including email addresses, will be visible to the public" cf. https://bugzilla.gnome.org/createaccount.cgi

If you have any further points to make, I guess upstream bugzilla is the more appropriate place.

Comment 30 Morten Welinder 2014-04-27 21:12:24 UTC

Google is now indexing searches.  I am not certain that is useful.
Here's a sample I see:

https://www.google.com/url?sa=t&rct=j&q=&esrc=s&source=web&cd=6&ved=0CEwQFjAF&url=https%3A%2F%2Fbugzilla.gnome.org%2Fbuglist.cgi%3Fbug_status%3DUNCONFIRMED%26bug_status%3DNEW%26bug_status%3DASSIGNED%26bug_status%3DREOPENED%26priority%3DNormal%26product%3Dgtk%252B%26query_format%3Dadvanced%26order%3Dbug_severity%252Cbug_id%26query_based_on%3D&ei=HOZcU5y5IobnsAS8jYCACg&usg=AFQjCNE2Ib_I24TkZ8dSYD62LcQdRIiIAQ&sig2=WE-hnKa67Ykuw433AE9DuQ&cad=rja

I don't see any evidence that the actual bug reports show up in google results.

Comment 31 André Klapper 2015-01-02 16:44:58 UTC

Patrick: Any updates regarding comment 27 and comment 30?

Comment 32 André Klapper 2018-06-16 13:27:45 UTC

After https://wiki.gnome.org/Initiatives/DevelopmentInfrastructure , GNOME is moving its task tracking from Bugzilla to GitLab at https://gitlab.gnome.org/ as previously announced in https://mail.gnome.org/archives/desktop-devel-list/2018-May/msg00026.html . See https://wiki.gnome.org/GitLab for more information.

Hence closing this ticket as WONTFIX: There are no plans to work on Bugzilla.