GNOME Bugzilla – Bug 87927
Make script to find similar stack traces
Last modified: 2006-01-16 08:30:56 UTC
It would be good to have bugzilla automatically look for duplicate bugs by comparing stack traces. IMO: First, stack traces would have to be identified; this shouldn't be too hard. Next, the traces need to be identified in some way; two traces of the same bug aren't going to be word-for-word identical.
Right. So- thoughts [feedback from the rest of bugmaster@ very welcome.] *Doing this correctly and automatically will be Hard. I'd like to start with a proof-of-concept CGI page, where people enter a bug number or a stack trace and get back a list of potential duplicates. *do we want to special case ()?? -type traces? *how do we decide whether or not to 'test' something for duplication? Just bug-buddy? bug-buddy + simple-bug-guide? *Questions: if the algorithm gets 'sophisticated' enough to do 'probable dup' vs. 'really damn certain it is a dup' how do we want to deal with those separate cases? Thoughts on very, very sketchy regexp+algorithm: 1)strip out all lines not beginning with # 2)strip all lines before <signal handler called> [this is not the only 'keyword' here; I need to search/recall what the other flavors of this are.] 3)maybe strip all #[*] 0x[*] leaving only function names? [Seems useful- maybe we can store the stripped down, function-name only version in the DB for quicker searching?][Maybe also strip everything /past/ the first ( ?] 4) so now we've got only function names. We pull the first one and search for it.[maybe we also strip/ignore really common top function names?] 5) If we get a hit on the first one, compare second/third function names. Match on all three is very likely a dup. [again, with exception like gdk_x_error stuff.] 6) If any of the matches are still open/unconfirmed, we display only those that are open, possibly 'weighting' for bugs with multiple duplicates already added. If no open matches are found, we list matches that were marked 'FIXED' first, then (maybe?) matches that were closed as RESOLVED. Flaws in the algorithm, so far: *completely ignores/munges multi-threaded stack traces. *obviously not robust to traces that are off-by-one, or cases where very common functions are called near the top of a trace. Other random thoughts: *we'll definitely have to special case some things, like gdk_x_error, that can occur across many apps. cc'ing Ben Liblit on the off chance he has any insight he might want to share; Ben, please feel free to ignore me :) Anyway, these ramblings are obviously incoherent/incomplete, but I thought I'd get them down for the record quickly. If someone can cook up a regexp for steps 1-4, I can whip up a test web page for steps 5 and 6 quickly. Otherwise it'll have to wait until I sit down with the Camel Book, which might be a few days.
Not knowing the internals of bugzilla, I'm going to assume that we have the text of a bug stored in an array, @theBug. my @functions = (); my @files = (); foreach my $line (@theBug) { if($line =~ /^\#(\d+)\s+0x[0-9a-fA-F]\s+in\s+(.*\(\))\s+from\s+(.*)$/) { push(@functions, $2); push(@files, $3); } } Given @functions from the above, and @functions from some other bug report, one could diff the two to find similar bugs.
The regular expression proposed by Ben FrantzDale will fail if GDB wraps its output. I've used the following pattern to good effect: /^\#\d+ 0x[0-9A-Fa-f]+ in (\w+) \(/ It only matches a prefix of each frame's description. The prefix is short enough to be unlikely to wrap, but long enough to capture the function name and be unlikely to match anything that *isn't* a stack frame.
killpg() is the other one I wanted to break/strip at, just to leave this as a note for myself.
FWIW, I'm using Ben Liblit's expression for right now, except with #\d+ +0x instead of #\d+ 0x since it was ignoring #0...#9 as they have two spaces after the \d. Work in progress (right now only working on perfecting the stripping) at http://bugzilla.gnome.org/simple-dup-finder.cgi Thanks a bunch for kicking me in the ass, Ben :)
Eck. Bad, bad things. So, simple-dup-finder fairly robustly gets the last five functions in a stack. All well and good. Search for the key phrases, and you get a result. Depending on the exact magic SQL incantation, said result takes about 4 minutes :/ So... I'm going to work on the SQL, but I doubt I can make it much better. Options: *have bug-buddy create/use other hidden fields to cache the results of the stack trace parsing. Advantage: very, very fast. Disad: makes upgrading more sucky :/ *Dump the data into the whiteboard instead of a custom set of fields? This makes things very fast, right now, when almost all whiteboards are empty. Disad: whiteboard is basically useless and off limits. *Live with some very big query times, that would presumably be happening fairly often. *queue the parsing, and do it on a daily basis from cron, at some hour where the fewest possible people will be inconvenienced. *Someone with more SQL knowledge than me helps me speed up the query some other way. I'm afraid ATM it's pretty much as straightforward and simple as I can make it. [I'll commit it soon, in current form.]
As for how often things get run, I was thinking that it would be something run every day or so rather than at the time of creation of each bug. As for speeding up the searching, you could probably narrow the search space dramatically by searching for all bugs that (1) might have a stack trace in them (i.e., contain "#0") (2) match the right set of function names and/or match the right set of filenames. BTW: how do you use the duplicate finder you linked above?
I'm already searching on the function names; that's what takes so long :) Right now, all that does is generate a list of function names; I'm then plugging those into a comment search on query.cgi, which works reliably. I'm right now incorporating the query directly into that page, but it'll be slow :)
Oh! I realize why it didn't make sense to you :) Try this: http://bugzilla.gnome.org/simple-bug-finder?bug_id=86839
Argh. http://bugzilla.gnome.org/simple-dup-finder.cgi?bug_id=86839
Comment from chris lahey: 'might want to distinguish between no symbols and no bug found' Probably also should distinguish between no symbols in trace and no trace.
Jody suggests doing version-based checking as well.
Once this is working reasonably well, someone should do a quick script (maby best done at a shell than over the web) to run this against every bug in the database. I'm imagining output like: bug xxxx has possible duplicates: yyyy zzzz ... You'd probably want to search in an increasing order.. Once a bug has been found as a possible duplicate of another bug, it probably shouldn't be searched against itself. With that output, people could go through and (hopefully) do some mass duplication marking.
It breaks for bug 88015. That bug has what appears to be a normal bug-buddy--generated trace, but the script doesn't see it.
Here's a thought: If any of the duplicates found have known duplicates, include the duplicate too. This would be particularly useful for TRACKER bugs. TRACKER bugs don't have stack traces generally, but the person running a dupe search should be aware that there is a TRACKER for their bug.
The listing/tracking of duplicate numbers really can't be done in our current DB[1]; it becomes much saner in 2.16 so I'll definitely want to have that added when we upgrade. [1] believe it or not, there is no field that keeps tracks of duplicates in bugzilla pre-2.14. The only way you know X is a duplicate of Y is by parsing all the text comments, which makes the query really nasty. Looking at 88015 right now.
Ah. 88015 gets ignored because it wasn't called from libgnomeui handler- it's just a 'normal' back trace. I'm not entirely sure I want to handle that case- we won't be 'automatically' parsing those from bug-buddy anyway. At any rate... I'll look at some special case code in there, but given the (crappy) way I wrote the code in the first place, this'll be a little ugly. So it isn't a high priority. Oh, and BTW, about parsing the DB, I definitely want to do that, if for no other reason than to get some stats on things, figure out what I might be missing, etc., etc.
Here's a false positive, I think: http://bugzilla.gnome.org/simple-dup-finder.cgi?bug_id=83738 finds both bug 71509 and bug 47920. In general, that search looks like it finds a few different famlies of duplicates. Some for nautilus and some for the CD player, among others.
Yeah, 83738 has a lot of bogus junk at the top that needs to be filtered out; that's why it is catching all the bogus dups. Thanks for recording that example, though; I'll need to test on things like that.
http://bugzilla.ximian.com/simple-dup-finder.cgi?bug_id=26103 18428 is a bogus dup of 26103. So... I should look into making these an ordered regexp instead of a series of SUBSTRs. It would probably be faster to boot.
You should just be able to do "$f1.*$f2.*$f3.*$f4.*$f5", right? (where that's the perl substring of the sql query.)
Ben: that's basically what I did last night before dinner; turns out mysql regexp is /abysmally/ slow. :/ So that's not going to be a usable solution.
http://bugzilla.gnome.org/show_bug.cgi?id=88137 Has a stack trace, but gets the "no stack trace" error.
John: that works here; did you mistype the bug number (either into the bug report or into the dup finder page?)
It would be good to check attachments for stack traces as in bug 88362. At least this could be done for the bug we are checking against, but ideally it would be done for the bugs that are searched.
Ben FrantzDale: any particular reason you removed me from the Cc: list?
Oh, hrm... I should have checked that... I assumed you did, Ben L. I assume it was just an error on Ben F.'s part? FWIW, re: checking attachments: I'm really unlikely to do that; this is going to mainly be for bug-buddy, first off, and secondly, checking/querying attachments requires loading them into memory and doing parsing of them- it can't be done directly in SQL. It would just be very irritating to implement and not at all worth the slowdown and waste involved.
Ben L: That's odd. I certainly didn't mean to remove you from the CC list. Looking at my emails, I can't even find the update when it happened. As for attachments, If they are stored on disk as files rather than in the DB, then yea, it wouldn't be worth it.
The title of the search page should include the bug number we searched for. (Personally, I'd prefer if the number came as early as possible in the title soas to fit in my galeon tabs easily. Perhaps "12345: possible dups" would be clear enough?. If not, no matter.)
If you search for dups of a bug, sometimes you get 99,000 "RESOLVED" bugs that are all (or mostly) duplicates of a single one. It would be nice if the page was aware of this and highlighted the results like this: UNCO The_bug_I_searched_for UNCO } UNCO } useful duplicates :-) UNCO } RESO Yada yada yada (400 duplicates found!) RESO Blada blada blada (5 duplicates found!)
It doesn't find the stack trace in bug 84528.
It doesn't find the stack trace in bug 88931
Lots of false gtk crap shows 88077 as a dup of 57250. Maybe all the gtk_* and g stuff needs to be filtered.
As part of the 'confidence' score it should do a component/product check.
You should put a "Search for duplicates" link somewhere in the bug page
while you're at it, the dup finder should allow you to enter an arbitrary stack trace and search for existing dups.
False positive: bug 89638.
Broken: http://bugzilla.gnome.org/simple-dup-finder.cgi?bug_id=89744 It finds these fuctions: 1. PL_HandleEvent 2. PL_ProcessPendingEvents 3. event_processor_callback 4. our_gdk_io_invoke 5. g_io_unix_dispatch yet that's starting at line #38.
Bug 86746 confuses this because it's first function appears to be named ".div".
This doesn't see the trace in bug 87710. It appears to have "sigsuspend" .
http://bugzilla.gnome.org/show_bug.cgi?id=59699 No symbols found.
change ?bug_id= to ?id= for consistency with show_bug.cgi. that would make it easier to switch between a bug and searching for a dup of it by replacing show_bug with simple-dup-finder
No stack symbols were found. Bug #87927
Sorry: I meant bug #87894 :)
Ok. Really sorry : I forgot bug_id=.
http://bugzilla.gnome.org/simple-dup-finder.cgi?bug_id=65516 No stack symbols were found in bug 65516.
Doesn't find the trace in bug 91819
Doesn't find the trace in bug 91822
*I've fixed bug 88015 and family (we search for '(gdb) bt') *I've fixed bug 88137 and family ('Backtrace was generated from %') *not sure how to handle bug 84528; I guess I'll add in 'Debugging Information'. *88931 has nothing other than 'lots of pound signs'; not sure how to handle that. Of course, maybe that's the best solution- shouldn't be that time consuming to just do the regexp on all fields in the bug and see what comes back. *I've resolved dave's request (IRL) to use id= instead of bug_id= *89744 seems to be broken because of the :: in the functions it ignores. Ben, you wrote the regexp; think you can take a look at why it is ignoring those? *all the rest should be caught by earlier fixes in this list. I'll probably open up a successor bug to deal with remaining issues.
Assuming the regexp being used now is similar to the one I originally suggested, then "::" can be allowed by finding the part of the regexp that looks like this: (\w+) and changing just that one part to this instead: ((\w|::)+) Note the additional set of parens, which might require adjustments to $2, $3, etc. depending on how the regular expression is being used. (I don't know where the script lives, so I can't check that myself.)
Wow, thanks, Ben, you rule :) That works like a charm. I'm trying out the more intelligent approach to seeing if a bug has a trace, but so far, it only works on some cases; I think I'm doing something with the perl.
Ah, figured it out; there was a second failure point I wasn't thinking about. I've caught that now. So, basically, everything with a stack trace, except attachments, should now be caught in some form or another. There is still the problem of bogus information in those traces, of course. I'll probably poke at that for a few more hours before heading home.
87710 still badly false-positives, as does 91822 to a lesser extent. Everything else 'works' in the sense that we get reasonable meaningful function names from them. Big thanks to all the people who helped collect these examples, and who kept doing so after I'd neglected the code for a month :) So, the next step is robustness of the algorithm used to find and identify duplicates. The current situation (where zillions of 'resolved' things can clutter the list) is... icky. Ben's first proposed partial solution doesn't actually work with the current DB (though it would with 2.16). I need to stare at the code and experiment a bit, I think; I was going to try to write out a plan but my brain is fried.
*** This bug has been marked as a duplicate of 95490 ***
Sigh. First change I make in bugzilla in like three weeks and it's /wrong/.
http://bugzilla.gnome.org/simple-dup-finder.cgi?id=94495 This shows 94495 twice.
96177 appears to have a stack trace but simple_dup_finder doesn't catch any.
No trace found in bug 60406
http://bugzilla.gnome.org/simple-dup-finder.cgi?id=61235 produces lots of hits. The trace reads as: 1. gtk_widget_event 2. gtk_main_do_event 3. gdk_event_dispatch 4. g_main_dispatch 5. g_main_iterate Which isn't very helpful or unique.
http://bugzilla.gnome.org/simple-dup-finder.cgi?id=95402 results in 97086, 94226, 95402 http://bugzilla.gnome.org/simple-dup-finder.cgi?id=97086 results in 97086, 94226 missing 95402. http://bugzilla.gnome.org/simple-dup-finder.cgi?id=94226 results in 97086, 94226, 95402, 93830 http://bugzilla.gnome.org/simple-dup-finder.cgi?id=93830 results in hundreds of other random bugs, thus suggesting the function calls are garbage. The function calls are different, as well: 95402: 1. uri_matches_as_parent 2. gnome_vfs_uri_is_parent 3. nautilus_file_operations_copy_move 4. icon_view_handle_uri_list 5. nautilus_marshal_VOID__POINTER_INT_INT_INT 97086: 1. uri_matches_as_parent 2. gnome_vfs_uri_is_parent 3. fm_directory_view_move_copy_items 4. icon_view_handle_uri_list 5. nautilus_marshal_VOID__POINTER_INT_INT_INT 94226: 1. gnome_vfs_uri_is_parent 2. icon_view_handle_uri_list 3. nautilus_marshal_VOID__POINTER_INT_INT_INT 4. g_closure_invoke 5. signal_emit_unlocked_R 93830: 1. __pthread_wait_for_restart_signal 2. pthread_cond_wait 3. poll 4. __pthread_manager 5. wait4 So, very confusing. these four all look to be duplicates by the summaries/stack traces, but simple-dup-finder doesn't think so (some of the time). :)
Now to come to think of it, there are a few situations where s-d-f works one way but not the other.
:) Yeah, and that's no good. A being a duplicate of B definately implies B is a duplicate of A.
A text box. Wouldn't that be nice. To have a text box to paste a stack trace into to do a quick dup-check without first having to file a bug then run s-d-f on it. It also means you can point people towards it on #gnome if they ask if their bug's already filed - in fact, someone just asked that ;-)
http://bugzilla.gnome.org/simple-dup-finder.cgi?id=97358 shows 73963 twice in the list.
http://bugzilla.gnome.org/simple-dup-finder.cgi?id=97526 does not report bug 97165.
http://bugzilla.gnome.org/simple-dup-finder.cgi?id=98794 lists bug 97883 twice (presumably because it has 2 stack traces)
http://bugzilla.gnome.org/simple-dup-finder.cgi?id=100291 results in a massive number of duplicates (if no good function names are found, perhaps we shouldn't search?)
http://bugzilla.gnome.org/simple-dup-finder.cgi?id=102245 The stack frame starting gnome_window_manager_get_settings is ignored
This should at least hit itself: bug 10191.
http://bugzilla.gnome.org/simple-dup-finder.cgi?id=108417 Bug 106166 is found twice.
http://bugzilla.gnome.org/simple-dup-finder.cgi?id=120771 The first 2 stack frames are ignored: GtkPromptService::GetGtkWindowForDOMWindow(nsIDOMWindow*) GtkPromptService::Confirm(nsIDOMWindow*, unsigned short const*, unsigned short const*, int*)
Doesn't see any frames in bug 121719
No stack symbols were found in bug 131243.
Apologies if it's been already mentioned here, have not read it closely. :) When we look for dups, would it be possible to get only the bug which has been marked Resolved and Fixed (i.e, if one exists) instead of listing all the bugs. Or we could have "Bugid" "Status" "Resolution" for all the bugs. Atleast that would help in finding the right one faster. :)
Well, some bugs accumulate a lot of dups when they are still open. In 2.16 (cough) I believe it is quite trivial to sort by the number of duplicates. (Yippee.) Of course, we are not quite running 2.16 yet :)
http://bugzilla.gnome.org/simple-dup-finder.cgi?id=121734 <- this really sucks, we should fix it. [Whenever I quit my job to become the bugzilla guy again ;)
http://bugzilla.gnome.org/simple-dup-finder.cgi?id=125523 also broken
http://bugzilla.gnome.org/simple-dup-finder.cgi?id=135288 This misses the first function in the trace (nautilus_desktop_link_get_link_type), possibly because it has no memory address before it?
Bug 135416 should find 128424, not sure why it doesn't.
Bug 169409 does not find itself. http://bugzilla.gnome.org/dupfinder/simple-dup-finder.cgi?id=169409
That's because it hits the limit of 100 I added (for some bugs it would show so many possible duplicates (useless stack trace) that it would take several minutes to show the report so I artificially cut it off at 100. We should add a comment about that...
bug 300983 doesn't find itself
Kjartan: That is because simple-dup-finder has been limited to 100 results maximum (see comment 81). I added a warning when it returns 100 results.
Bug #303466 does not list itself
It appears the problem is that the simple dup finder trying to get functions from two separate stack traces for 303466...interesting. Too bad I don't have much time to look further at the moment--someone please ping me in two to three weeks if no one else has taken a look.
Comment 84: Cause is simple-dup-finder taking the functions from multiple comments and matching them per comment. It should take them from the comment with the best stacktrace and not multiple ones. Easy fix is to limit it to the first comment with functions in it. The SQL already has a regex selects those. Adding a 'limit 1' would fix it. I think I'll add a "ORDER BY bug_when DESC LIMIT 1". That will fetch the functions from the newest stacktrace comment (that should be the best one... providing the newest comment always has the best stacktrace). grmbl.. Elijah is too fast
Made the change, bug #303466 finds itself again.
the function calls from Bug 306582 are not extracted properly. every odd function name is emited. http://bugzilla.gnome.org/dupfinder/simple-dup-finder.cgi?id=306582
One of the very familiar dups of Dia does not get found due to the genertated IA__ prefix. Removing that manually from the stack trace makes it work again. Maybe that prefix should be stripped by the dup-finder script? Dup: bug #308678 Orig: bug #161603
Yeah, it really should. I tend to use boogle to manually do that right now, and was the main reason for adding the link to the boogle search on the simple-dup-finder page (though I use it for other things as well...)
Consider Bug #169193 . the trace contains ================
+ Trace 61799
I believe simple-dup-finder should a) warn that we have missing symbols within the first 5 frames b) abort collecting function names as soon as it hits missing symbols, i.e. only extract four function names from the above trace. What it currently does is pick up five function names, regardless of unknown function names it encounters on its way.
Suggestion: Put the warning "Warning: Number of bugs has been limited to 100." above the duplicate listing.
It's not too infrequent that we get crap stack traces and get the same ones over and over--and the dupfinder helpfully points out the potential duplicates. There are a number of traces like this where if we only extra functions from the beginning of the trace then we won't get any and won't be notified about duplicates. But I put the boogle link inside the simple-dup-finder output for exactly this reason (though it also allows refining the search in other ways too). Putting the warning at the top shouldn't be real hard, though it means slurping in the outputs of the SQL query into memory, counting them, then doing output, instead of output things on the fly and counting them as we go and then displaying a warning if the count happens to be 100. If you'd like to look into fixing this, just look at bugzilla-new/dupfinder/simple-dup-finder.cgi and bugzilla-new/dupfinder/find-traces.pl. Neither is very long.
Agreed - we get many useless traces. I still suggest missing symbols should not be ignored but taken into account, i. e. it should extract "??" as function names. This way it is still possible to detect duplicates of useless traces. Additionally, if time and manpower permit, we could extend s-d-f to scan for duplicates in a smart way so that any function passes the match test if the template trace has a missing symbol. For instance, the example in Comment #91 s-d-f should IMHO extract 1 gtk_window_move 2 terminal_screen_get_text_selected 3 terminal_screen_get_text_selected 4 gtk_list_store_remove 5 ?? when scanning for duplicates any function name should pass the matching test for the missing symbol. This way crap trace _and_ eventually non-missing-symbols traces will be detected.
Your suggestion defintely has merit and I can see that it'd be useful in many cases, but it's also an example where we are just trading off which cases s-d-f is most useful in. With your scheme, if one stack trace of a bug is missing some symbols and another isn't then you can't detect they are duplicates with s-d-f (sometimes you may not be able to anyway, but I've found on many occasions that I can). Also, I have found dupes with s-d-f where your scheme would just extract 5 ??'s, which isn't useful (yes, such stack traces usually aren't at all trustworthy but if there are enough functions then s-d-f can sometimes give a small number of bugs to check and I can verify that they're dups by quickly reading the descriptions of each). It's hard to tell which choice will provide the best productivity with the tool. Both are still possible regardless of the choice because of the boogle link, and and my basic feeling right now is that it's easier to delete function names you don't want to search on from a boogle query than to try and add them.
I can stick with boogle for the time being.
bug #317935 : function names get collected across threads. Is that intended?
We should blacklist the function: libgnomeui_segv_handle possibly others as well
Consider http://bugzilla.gnome.org/show_bug.cgi?id=324448 two frames bear the same name, but s-d-f extracts e_cal_backend_http_get_type only once...
Yeah, simple-dup-finder is not smart, it just lists bugs which have all these 'words' in one comment. It avoids the same function twice on purpose.
bug 326345 got a stacktrace, but simple-dup-finder claims that there is none.
(In reply to comment #101) > bug 326345 got a stacktrace, but simple-dup-finder claims that there is none. Simple-dup-finder tries to get the newest stacktrace using SQL. After that it uses perl to actually parse the comment. Because the user quoted the *entire* description (aargh!!), the SQL sees the stacktrace but the perl code doesn't accept it (because of the '> ' at the beginning of the lines). Have to enhance the SQL to find the correct one. Plus make the reply option not quote anything if the comment is over a certain length (grr).
*** Bug 327034 has been marked as a duplicate of this bug. ***
Should we break this bug up? It has outlived the original purpose, probably...
No Simple Dup Finder available for bug 326478, although it has a stacktrace.
+1 from me for closing this bug and just opening new ones for any new (or remaining) issues. Karsten: That's because bugzilla currently only shows helpful stuff (triage links, simple-dup-finder link) when bugs are unconfirmed; it's a separate issue from this bug anyway (since the simple-dup-finder does find the stack trace in that bug).
Closing. Created simple-dup-finder component. Changed the report to mention a new bugreport has to be created.