Bug 82751 – wrong grouping/supressed articles on multipart messages

After an evaluation, GNOME has moved from Bugzilla to GitLab. Learn more about GitLab.
No new issues can be reported in GNOME Bugzilla anymore.
To report an issue in a GNOME project, go to GNOME GitLab.
Do not go to GNOME Gitlab for: Bluefish, Doxygen, GnuCash, GStreamer, java-gnome, LDTP, NetworkManager, Tomboy.

Bug 82751 - wrong grouping/supressed articles on multipart messages


Summary:	wrong grouping/supressed articles on multipart messages


Status:	RESOLVED FIXED

Product:	Pan
Classification:	Other
Component:	general
Version:	pre-0.12.0 betas
Hardware:	Other Linux

Importance:	Normal normal
Target Milestone:	0.12.2
Assigned To:	Charles Kerr
QA Contact:	Charles Kerr

URL:
Whiteboard:

Depends on:
Blocks:

Reported:	2002-05-23 13:13 UTC by Roman Möller
Modified:	2006-06-18 05:14 UTC

See Also:
GNOME target:	---
GNOME version:	---

Description Roman Möller 2002-05-23 13:13:24 UTC

This one is best described by example. After downloading from
alt.binaries.movies.shadowrealm the article pane shows eg

[green bitmap] |-some text (1of2).part01.P01 (1/45)              222361
[red bitmap  ] |-some text (2of2).part01.P01 (1/45)                5002

That's wrong, because all 90 articles are available on the server, as I
found out with another reader. As a hint : 0.11.92 displayed the same group
on the same server as

[green bitmap] |-some text (1of2).part01.P01 (1/45) (89)         222361
[red bitmap  ] |-some text (2of2).part01.P01 (1/45)                5002

Regards
Roman

Comment 1 Roman Möller 2002-07-01 11:48:33 UTC

I've traced this one down to the function normalize_subject in
base/article-thread.c .
When stripping "multipart-strings" from from the subject, the
condition reads :

if (multipart && (*in=='('||*in=='[') && isdigit(in[1])

which is a little bit too weak, as it also strips the "(1of2)" string
in the example above. But that string is the only one, which
distinguishes the two different posts.

I've changed that for a more concise test to

if (multipart && (*in=='('||*in=='[') && is_multipart_string(in))

where the new function reads :

static gboolean
is_multipart_string (const guchar *p)
{
	if (*p != '(' && *p != '[') return FALSE ;
	if (!isdigit(*++p)) return FALSE; 
	while(isdigit(*p)) p++ ;
	if (!ispunct(*p++)) return FALSE ;
	if (!isdigit(*p)) return FALSE; 
	while(isdigit(*p)) p++ ;
	if (*p != ')' && *p != ']') return FALSE ;
	return TRUE;
}

Probably a regex would be more readable, but my general overview of
the code is not sufficient enough (e.g. where to put the initial
regcomp of the string ).
Sincerely
Roman

Comment 2 Charles Kerr 2002-07-02 21:19:39 UTC

Why was this marked fixed?

Comment 3 Christophe Lambin 2002-07-02 23:22:20 UTC

No idea. Re-opening.

Comment 4 Charles Kerr 2002-07-03 06:01:53 UTC

I think Roman's suggestion is a good one, but I'm not
too keen on the idea of putting is_multipart_string()
into Pan.  normalize is one of Pan's biggest bottlenecks
already.

Comment 5 Charles Kerr 2002-07-23 05:44:14 UTC

I think we could fix this by doing an ispunct() check while
we walk through the "(dd/dd)" string, and if we hit it,
back up and don't skip.

Comment 6 Charles Kerr 2002-07-23 18:02:38 UTC

How does this look?

/* strip multipart information of the form (xx/yy).
 * when walking through the multipart, watch for deviations
 * from this format so that we don't lose strings like
 * (01of39).  (#82751) */
if (multipart && (*in=='('||*in=='[') && isdigit(in[1])) {
   const guchar * start = in;
   while (*in && *in!=']' && *in!=')') {
      if (ispunct(*in)) {
         in = start;
         break;
      } 
      ++in;
   }
   continue;
}

Comment 7 Roman Möller 2002-07-24 14:04:29 UTC

Sorry, the longer I stare on this code, the more I'm convinced of
staring on a deadloop. Also, the ispunct test doesn't seem correct to
me. Condition should read : if we meet something, which is not a digit
or a punctuation, then keep it.
So I would write 

if (multipart && (*in=='('||*in=='[') && isdigit(in[1])) {
   const guchar * start = in;
   while (*in && *in!=']' && *in!=')') {
      if (isalpha(*in)) {
         in = ++start; // now, the opening '(' or '[' is lost, but
                       // those would be filtered later on anyway.
         break;
      } 
      ++in;
   }
   continue;
}

Comment 8 Charles Kerr 2002-07-24 16:19:15 UTC

I don't think that keeping all but digits would fix the
problem: the normalized form of each top-level article
would be "some text (of).part01.P01", which isn't enough
information to let the threader know which top-level
article is the parent of a subpart.

Yes, you're right, there's a deadloop there, and ispunct()
was wrong.  How about:


while (*in)
{
   /* strip multipart information */
   if (multipart && (*in=='('||*in=='[') && isdigit(in[1])) {
      const guchar * start = in;
      while (*in) {
         if (!isdigit(*in) && *in!='/' && *in!='|') {
            /* oops, this isn't multipart information */
            *out++ = *start;
            in = start + 1;
            break;
         }
         if (*in!=']' && *in!=')')
            ++in;
      }
      continue;
   }

   /* strip out junk that breaks sorting  */
   if (_keep_chars[*in])
   *out++ = tolower(*in);

   ++in;
}

Comment 9 Roman Möller 2002-07-24 19:53:54 UTC

Keeping all but digits ? I do not do that.
Normalized form will not be "some text (of).part01.P01"
It will be "some text 01of03part01p01".
( not verified, only reading code. Will verify tomorrow )
Parentheses are not in _keep_chars. The line to build _keep_chars reads

_keep_chars[i] = isalnum(uch) || isdigit(uch) || isspace(uch)

No parenthesis (or dots) included. Btw, the isdigit test is redundant,
Already included in isal_num_.

I'm a little bemazed. Are we talking about the same code basis ?

To your proposel : No need to worry about parenthesis, or first
matched sign ( It's gonna be stripped anyway). You say, it's a
multipart string, if its only contains digits or /|. I say, it's not a
multipart string, if there is a letter. Thats pretty much the same. A
little more general with my proposal, because e.g. (10.20) would be
declared as multipart.

Sincerely
Roman

Comment 10 Charles Kerr 2002-07-24 21:09:07 UTC

> I'm a little bemazed. Are we talking about the same code basis ?

The error is on my end.  I had a handful of things going on here
and didn't give enough time to reading your patch.  I agree with
your rationale & have tested your patch; it seems fine.

Just to make sure we're all on the same page,
here is the patch:
http://cvs.gnome.org/bonsai/cvsview2.cgi?diff_mode=context&whitespace_mode=show&subdir=pan/pan/base&command=DIFF_FRAMESET&file=article-thread.c&rev1=1.30&rev2=1.31&root=/cvs/gnome

and here is the result of the test:
** Message: in: [|-some text (1of2).part01.P01 (1/45)]
** Message: out: [some text 1of2part01p01 ]

Thanks for your patience. ;)