GNOME Bugzilla – Bug 538224
Podcast RSS parsing doesn't handle non-UTF encodings
Last modified: 2009-01-07 18:45:30 UTC
Please describe the problem: Banshee doesn't display special characters (e.g. ä, ö or ü) correctly. There's a diamond with a '?' inside instead. Example podcast: http://www.dradio.de/rss/podcast/sendungen/politischesfeuilleton Steps to reproduce: 1. subsscribe to a podcast that (e.g. http://www.dradio.de/rss/podcast/sendungen/politischesfeuilleton) 2. look at the episode descriptions 3. Actual results: Expected results: Special characters should be displayed correctly. Does this happen every time? Yes. Other information:
*** Bug 539546 has been marked as a duplicate of this bug. ***
It looks like the feed indicated here and the feed indicated in the duplicate are encoded in iso-8859. A conversion might be necessary, if it's not done by the XML parser.
Created attachment 121162 [details] [review] Find out the encoding of the XML and use it This patch takes the XML and its 'encoding' attribute, and parses the RSS using this encoding, and not using the system default (usually UTF-8).
Hrm, seems like this patch should not be XML dependent at all. Can you try to parse the HTTP header for the encoding, and use that? That way we'll properly get the UTF System.String for any returned string value.
Gabriel, you are right. I am trying your commend, but now I have found a problem: a RSS feed which HTTP response (HttpWebResponse) that has ISO-8859-1 as CharacterSet but the XML downloaded has UTF-8 encoding in its doctype. http://www.ikerjimenez.com/podcast.xml What to do in this case, complaining to the podcast author or parsing the encoding inside the XML?
Hrm, with that being the case (and the opposite happening, where it says the document is UTF-8 but the xml encoding is set to 8859-1) I guess we do have to hack around the problem. Can we try to optimize your patch, though? Creating an XmlDocument is relatively expensive. Could probably instead take a substring of some arbitrary length (say 40) and use IndexOf calls or a regex to pull out the encoding. A regex would probably be cleanest. See src/Libraries/Hyena/Hyena/CryptoUtil.cs for a regex example.
Created attachment 121402 [details] [review] Find out the encoding of the XML and use it
It is the same as before, but just playing with the string methods. I don't think it is worth to use regular expressions here.
Just a suggestion. Instead of: + if (s.StartsWith("<?xml")) { I would do: if (s.TrimStart ().StartsWith (... Because you may find a bit of rubbish before the document actually starts (I've seen this sometimes). Or even better: s = Encoding.GetString (resultPtr).TrimStart ();
Please also follow the coding style guidelines : space before method parenthesis.
Created attachment 124388 [details] [review] Find out the encoding of the XML and use it I hope you like the patch more now. I am really new with C#.
(In reply to comment #11) > Created an attachment (id=124388) [edit] > Find out the encoding of the XML and use it + } catch (ArgumentException) {} Why this? An ArgumentException can be normally avoided ahead of time instead of using a catch block. BTW, I would convert the later catch{s=""} block into something that at least sends the exception to the log.
Well, I doubted about catching that kind of exception or not. I did it following the reference of the method: http://msdn.microsoft.com/en-us/library/t9a3kf7c.aspx The exception will be thrown if the given encoding is unknown or wrong, but I want to catch it in order to keep on using the default encoding (and not return an empty string). I agree with treating the later catch{s=""}, but I have been trying not to change a lot of code, given my short experience. Thanks for your commends! ¡Gracias por tus consejos!
Thanks so much for the patch, Benjamin, I've committed it.