GNOME Bugzilla – Bug 363268
mixed-charset messages get garbled
Last modified: 2006-11-02 17:58:31 UTC
Messages written in UTF-8 is producing rubbish. The same applies to reading messages from others written in UTF-8. I suspect it is the same bug that was in the old PAN where it was not able to handle messages written in UTF-8 if the news server or the poster had a footer using a different charset. Example: ÊÞå ÃÃà <------- UTF-8 (æøå ÆØÅ) <----- part below written in ISO-8859-1 -- Hilsen/Regards Michael Rasmussen http://keyserver.veridis.com:11371/pks/lookup?op=get&search=0xE3E80917 -------------------------------------------------------- Denne postliste er til test af din email i forhold til SSLUGs postlister. Vær sød ikke at misbruge denne
chris, could you take a look at this ticket?
Michael: can you attach the following evidences: - an article you've READ where UTF-8 is broken - an article you've POSTED where UTF-8 is broken - your sig
Copy of read article: Return-Path: <sslug-novice-return-38839-mail2news=sslug.dk@sslug.dk> Delivered-To: mail2news@sslug.dk Mailing-List: contact sslug-novice-help@sslug.dk; run by ezmlm Precedence: bulk X-No-Archive: yes list-help: <mailto:sslug-novice-help@sslug.dk> list-unsubscribe: <mailto:sslug-novice-unsubscribe@sslug.dk> errors-to: sslug-error@sslug.dk Delivered-To: mailing list sslug-novice@sslug.dk From: "Michael Schmidt" <michael.zmit@gmail.com> Date: Thu, 19 Oct 2006 00:10:32 +0200 Organization: SSLUG Lines: 25 Message-ID: <op.thm07utzdi3geh@news.sslug.dk> Mime-Version: 1.0 Content-Type: text/plain; format=flowed; delsp=yes; charset=utf-8 Content-Transfer-Encoding: 8bit NNTP-Posting-Date: Wed, 18 Oct 2006 22:12:09 +0000 (UTC) User-Agent: Opera Mail/9.02 (Linux) Subject: Re: [NOVICE] Ubuntu 6.06 LTS Newsgroups: sslug.novice References: <eh4j24$38l$1@www.sslug.dk> <eh5hua$agt$1@shrek.krogh.cc> <eh5tao$87d$1@www.sslug.dk> <eh5tme$aaa$1@www.sslug.dk> <eh5ua7$g2p$1@www.sslug.dk> <eh5ugo$979$1@shrek.krogh.cc> <eh62br$bhk$1@www.sslug.dk> Approved: news@sslug.dk Path: news.sslug.dk!sslug.dk!not-for-mail Xref: news.sslug.dk sslug.novice:38341 Wed, 18 Oct 2006 22:21:15 +0200, JÞrgen Heesche <heesche@webspeed.dk> skrev: > Jesper Krogh wrote: >> I sslug.novice, skrev Claus: >>> Atte André Jensen wrote: >>>> Claus wrote: >>>>> OK, den er sat til at downloade. >>>>> Men hvordan skal den brÊndes for at det bliver gjort rigtigt? >>>> hvad med "cdrecord ubuntu-6.06.1-alternate-i386.iso"? >>> Hmmm, den fylder 713550 KB. >>> Kan det vÊre pÃ¥ en CD? >> Det er den designet til.. sÃ¥ det vil jeg tro. > > Jeg har altid forstÃ¥et at maximum er 650 MB pÃ¥ en CD. > Det var det ogsÃ¥ tidligere. Idag er 700MB/80min nÊrmest blevet standard, men der findes ogsÃ¥ 800MB/90min og sÃ¥gar ogsÃ¥ 900MB/100min, men de to sidstnÊvnte krÊver at drev og brÊndersoftware kan hÃ¥ndtere dem. -- Med venlig hilsen /Zmit/ RLU # 314205 sslug-novice: Listen for begynder-relaterede spørgsmål Copy of posted article: Return-Path: <sslug-test-return-5727-mail2news=sslug.dk@sslug.dk> Delivered-To: mail2news@sslug.dk Mailing-List: contact sslug-test-help@sslug.dk; run by ezmlm Precedence: bulk X-No-Archive: yes errors-to: sslug-error@sslug.dk Delivered-To: mailing list sslug-test@sslug.dk From: "Michael Rasmussen" <mir@miras.org> Date: Thu, 19 Oct 2006 21:46:37 +0000 (UTC) Organization: SSLUG Lines: 6 Message-ID: <eh8rnt$52u$1@www.sslug.dk> Mime-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit NNTP-Posting-Date: Thu, 19 Oct 2006 21:46:37 +0000 (UTC) User-Agent: pan 0.117 (We'll fly and we'll fall and we'll burn) Subject: [TEST] Test with UTF-8 Newsgroups: sslug.test Approved: news@sslug.dk Path: news.sslug.dk!sslug.dk!not-for-mail Xref: news.sslug.dk sslug.test:5193 ÊÞå ÃÃà -- Hilsen/Regards Michael Rasmussen http://keyserver.veridis.com:11371/pks/lookup?op=get&search=0xE3E80917 -------------------------------------------------------- Denne postliste er til test af din email i forhold til SSLUGs postlister. Vær sød ikke at misbruge denne My signature: -- Hilsen/Regards Michael Rasmussen http://keyserver.veridis.com:11371/pks/lookup?op=get&search=0xE3E80917 Signature added by list: -------------------------------------------------------- Denne postliste er til test af din email i forhold til SSLUGs postlister. Vær sød ikke at misbruge denne
BTW. The old bug I am refering to is this one: http://bugzilla.gnome.org/show_bug.cgi?id=317156
Oh, I remember that bug. Thanks for the reference: that saved me some time. Charles: this is what's happening: pan::content_to_utf8() tries to pass the article through g_convert for the different charsets (article's charset, group's charset and hardcoded CURRENT and ISO-8859-15). The article contains both UTF-8 and ISO-8859-1 characters. If you use g_convert to convert from ISO-8859-1 to UTF-8, g_convert will actually consider the UTF-8 2byte characters as ISO-8859-1 and succeed ! The UTF-8 characters will of course be garbled. You can't simply remove ISO-8859-1 from the hardcoded fallback charsets, since the group's charset may still be ISO-8859-1 and you'll have the same problem. The old Pan did not suffer from his problem, since it did not use g_convert() for this purpose. It used an internal function (g_mime_charset_strndup), which used gmime streams to convert. That approach did not suffer from that problem. Reassigning to you, since I have no idea how to fix that without major surgery. :)
Chris: so if I reimplement content_to_utf8() to use the old g_mime_charset_strndup() code, that would fix this?
assuming the underlying gmime still behaves the same way, I'd guess so.
Created attachment 75733 [details] [review] cvs head patch Replacing g_convert with g_mime_charset_strndup is a simple one-liner. Does this fix it?
I tested it, and it indeed no longer garbles the message (i.e. the part of the message that's in the content-type's charset. With this patch, the invalid characters are simply removed, whereas the old Pan would display '?' for each non-utf8 character. It looks like this is a difference in the underlying gmime behaviour: the non-utf8 characters are already removed by the time g_mime_charset_strndup() returns.
So although we're still losing those invalid characters, old-pan did that too and we're better off before by not garbling the message, is that right? So should this change be checked in and the bug marked closed?
Yes, works for me.