GNOME Bugzilla – Bug 82751
wrong grouping/supressed articles on multipart messages
Last modified: 2006-06-18 05:14:56 UTC
This one is best described by example. After downloading from alt.binaries.movies.shadowrealm the article pane shows eg [green bitmap] |-some text (1of2).part01.P01 (1/45) 222361 [red bitmap ] |-some text (2of2).part01.P01 (1/45) 5002 That's wrong, because all 90 articles are available on the server, as I found out with another reader. As a hint : 0.11.92 displayed the same group on the same server as [green bitmap] |-some text (1of2).part01.P01 (1/45) (89) 222361 [red bitmap ] |-some text (2of2).part01.P01 (1/45) 5002 Regards Roman
I've traced this one down to the function normalize_subject in base/article-thread.c . When stripping "multipart-strings" from from the subject, the condition reads : if (multipart && (*in=='('||*in=='[') && isdigit(in[1]) which is a little bit too weak, as it also strips the "(1of2)" string in the example above. But that string is the only one, which distinguishes the two different posts. I've changed that for a more concise test to if (multipart && (*in=='('||*in=='[') && is_multipart_string(in)) where the new function reads : static gboolean is_multipart_string (const guchar *p) { if (*p != '(' && *p != '[') return FALSE ; if (!isdigit(*++p)) return FALSE; while(isdigit(*p)) p++ ; if (!ispunct(*p++)) return FALSE ; if (!isdigit(*p)) return FALSE; while(isdigit(*p)) p++ ; if (*p != ')' && *p != ']') return FALSE ; return TRUE; } Probably a regex would be more readable, but my general overview of the code is not sufficient enough (e.g. where to put the initial regcomp of the string ). Sincerely Roman
Why was this marked fixed?
No idea. Re-opening.
I think Roman's suggestion is a good one, but I'm not too keen on the idea of putting is_multipart_string() into Pan. normalize is one of Pan's biggest bottlenecks already.
I think we could fix this by doing an ispunct() check while we walk through the "(dd/dd)" string, and if we hit it, back up and don't skip.
How does this look? /* strip multipart information of the form (xx/yy). * when walking through the multipart, watch for deviations * from this format so that we don't lose strings like * (01of39). (#82751) */ if (multipart && (*in=='('||*in=='[') && isdigit(in[1])) { const guchar * start = in; while (*in && *in!=']' && *in!=')') { if (ispunct(*in)) { in = start; break; } ++in; } continue; }
Sorry, the longer I stare on this code, the more I'm convinced of staring on a deadloop. Also, the ispunct test doesn't seem correct to me. Condition should read : if we meet something, which is not a digit or a punctuation, then keep it. So I would write if (multipart && (*in=='('||*in=='[') && isdigit(in[1])) { const guchar * start = in; while (*in && *in!=']' && *in!=')') { if (isalpha(*in)) { in = ++start; // now, the opening '(' or '[' is lost, but // those would be filtered later on anyway. break; } ++in; } continue; }
I don't think that keeping all but digits would fix the problem: the normalized form of each top-level article would be "some text (of).part01.P01", which isn't enough information to let the threader know which top-level article is the parent of a subpart. Yes, you're right, there's a deadloop there, and ispunct() was wrong. How about: while (*in) { /* strip multipart information */ if (multipart && (*in=='('||*in=='[') && isdigit(in[1])) { const guchar * start = in; while (*in) { if (!isdigit(*in) && *in!='/' && *in!='|') { /* oops, this isn't multipart information */ *out++ = *start; in = start + 1; break; } if (*in!=']' && *in!=')') ++in; } continue; } /* strip out junk that breaks sorting */ if (_keep_chars[*in]) *out++ = tolower(*in); ++in; }
Keeping all but digits ? I do not do that. Normalized form will not be "some text (of).part01.P01" It will be "some text 01of03part01p01". ( not verified, only reading code. Will verify tomorrow ) Parentheses are not in _keep_chars. The line to build _keep_chars reads _keep_chars[i] = isalnum(uch) || isdigit(uch) || isspace(uch) No parenthesis (or dots) included. Btw, the isdigit test is redundant, Already included in isal_num_. I'm a little bemazed. Are we talking about the same code basis ? To your proposel : No need to worry about parenthesis, or first matched sign ( It's gonna be stripped anyway). You say, it's a multipart string, if its only contains digits or /|. I say, it's not a multipart string, if there is a letter. Thats pretty much the same. A little more general with my proposal, because e.g. (10.20) would be declared as multipart. Sincerely Roman
> I'm a little bemazed. Are we talking about the same code basis ? The error is on my end. I had a handful of things going on here and didn't give enough time to reading your patch. I agree with your rationale & have tested your patch; it seems fine. Just to make sure we're all on the same page, here is the patch: http://cvs.gnome.org/bonsai/cvsview2.cgi?diff_mode=context&whitespace_mode=show&subdir=pan/pan/base&command=DIFF_FRAMESET&file=article-thread.c&rev1=1.30&rev2=1.31&root=/cvs/gnome and here is the result of the test: ** Message: in: [|-some text (1of2).part01.P01 (1/45)] ** Message: out: [some text 1of2part01p01 ] Thanks for your patience. ;)