Bug 538224 – Podcast RSS parsing doesn't handle non-UTF encodings

After an evaluation, GNOME has moved from Bugzilla to GitLab. Learn more about GitLab.
No new issues can be reported in GNOME Bugzilla anymore.
To report an issue in a GNOME project, go to GNOME GitLab.
Do not go to GNOME Gitlab for: Bluefish, Doxygen, GnuCash, GStreamer, java-gnome, LDTP, NetworkManager, Tomboy.

Bug 538224 - Podcast RSS parsing doesn't handle non-UTF encodings


Summary:	Podcast RSS parsing doesn't handle non-UTF encodings


Status:	RESOLVED FIXED

Product:	banshee
Classification:	Other
Component:	Podcasting
Version:	1.3.1
Hardware:	Other All

Importance:	Normal normal
Target Milestone:	1.4.2
Assigned To:	Mike Urbanski
QA Contact:	Mike Urbanski

URL:
Whiteboard:

Duplicates:	539546 (view as bug list)
Depends on:
Blocks:

Reported:	2008-06-13 22:25 UTC by Lukas Michelbacher
Modified:	2009-01-07 18:45 UTC

See Also:
GNOME target:	---
GNOME version:	---

Attachments
Find out the encoding of the XML and use it (1.21 KB, patch) 2008-10-22 22:15 UTC, Benjamín Valero Espinosa	needs-work	Details \| Review
Find out the encoding of the XML and use it (1.10 KB, patch) 2008-10-26 23:05 UTC, Benjamín Valero Espinosa	needs-work	Details \| Review
Find out the encoding of the XML and use it (1.33 KB, patch) 2008-12-10 22:20 UTC, Benjamín Valero Espinosa	committed	Details \| Review

Description Lukas Michelbacher 2008-06-13 22:25:33 UTC

Please describe the problem:
Banshee doesn't display special characters (e.g. ä, ö or ü) correctly. There's a diamond with a '?' inside instead.

Example podcast:
http://www.dradio.de/rss/podcast/sendungen/politischesfeuilleton

Steps to reproduce:
1. subsscribe to a podcast that (e.g. http://www.dradio.de/rss/podcast/sendungen/politischesfeuilleton)
2. look at the episode descriptions
3. 


Actual results:


Expected results:
Special characters should be displayed correctly.

Does this happen every time?
Yes.

Other information:

Comment 1 Bertrand Lorentz 2008-07-17 22:13:01 UTC

*** Bug 539546 has been marked as a duplicate of this bug. ***

Comment 2 Bertrand Lorentz 2008-07-17 22:18:30 UTC

It looks like the feed indicated here and the feed indicated in the duplicate are encoded in iso-8859.

A conversion might be necessary, if it's not done by the XML parser.

Comment 3 Benjamín Valero Espinosa 2008-10-22 22:15:28 UTC

Created attachment 121162 [details] [review]
Find out the encoding of the XML and use it

This patch takes the XML and its 'encoding' attribute, and parses the RSS using this encoding, and not using the system default (usually UTF-8).

Comment 4 Gabriel Burt 2008-10-22 22:22:24 UTC

Hrm, seems like this patch should not be XML dependent at all.  Can you try to parse the HTTP header for the encoding, and use that?  That way we'll properly get the UTF System.String for any returned string value.

Comment 5 Benjamín Valero Espinosa 2008-10-24 15:02:35 UTC

Gabriel, you are right. I am trying your commend, but now I have found a problem: a RSS feed which HTTP response (HttpWebResponse) that has ISO-8859-1 as CharacterSet but the XML downloaded has UTF-8 encoding in its doctype.

http://www.ikerjimenez.com/podcast.xml

What to do in this case, complaining to the podcast author or parsing the encoding inside the XML?

Comment 6 Gabriel Burt 2008-10-25 02:53:40 UTC

Hrm, with that being the case (and the opposite happening, where it says the document is UTF-8 but the xml encoding is set to 8859-1) I guess we do have to hack around the problem.

Can we try to optimize your patch, though?  Creating an XmlDocument is relatively expensive.  Could probably instead take a substring of some arbitrary length (say 40) and use IndexOf calls or a regex to pull out the encoding.  A regex would probably be cleanest.  See src/Libraries/Hyena/Hyena/CryptoUtil.cs for a regex example.

Comment 7 Benjamín Valero Espinosa 2008-10-26 23:05:13 UTC

Created attachment 121402 [details] [review]
Find out the encoding of the XML and use it

Comment 8 Benjamín Valero Espinosa 2008-10-26 23:06:25 UTC

It is the same as before, but just playing with the string methods. I don't think it is worth to use regular expressions here.

Comment 9 Andrés G. Aragoneses (IRC: knocte) 2008-12-04 15:01:06 UTC

Just a suggestion. Instead of:

+                    if (s.StartsWith("<?xml")) {

I would do:

if (s.TrimStart ().StartsWith (...


Because you may find a bit of rubbish before the document actually starts (I've seen this sometimes).

Or even better:

s = Encoding.GetString (resultPtr).TrimStart ();

Comment 10 Bertrand Lorentz 2008-12-07 14:55:23 UTC

Please also follow the coding style guidelines : space before method parenthesis.

Comment 11 Benjamín Valero Espinosa 2008-12-10 22:20:58 UTC

Created attachment 124388 [details] [review]
Find out the encoding of the XML and use it

I hope you like the patch more now. I am really new with C#.

Comment 12 Andrés G. Aragoneses (IRC: knocte) 2008-12-11 15:17:58 UTC

(In reply to comment #11)
> Created an attachment (id=124388) [edit]
> Find out the encoding of the XML and use it

+                        } catch (ArgumentException) {}

Why this? An ArgumentException can be normally avoided ahead of time instead of using a catch block. BTW, I would convert the later catch{s=""} block into something that at least sends the exception to the log.

Comment 13 Benjamín Valero Espinosa 2008-12-11 15:59:31 UTC

Well, I doubted about catching that kind of exception or not. I did it following the reference of the method:

http://msdn.microsoft.com/en-us/library/t9a3kf7c.aspx

The exception will be thrown if the given encoding is unknown or wrong, but I want to catch it in order to keep on using the default encoding (and not return an empty string).

I agree with treating the later catch{s=""}, but I have been trying not to change a lot of code, given my short experience.

Thanks for your commends!
¡Gracias por tus consejos!

Comment 14 Gabriel Burt 2009-01-07 18:45:30 UTC

Thanks so much for the patch, Benjamin, I've committed it.