GNOME Bugzilla – Bug 395050
autodetection of the CSS file's encoding fails
Last modified: 2020-08-11 15:46:34 UTC
From http://mail.gnome.org/archives/libcroco-list/2006-November/msg00001.html: I was trying to use cr_om_parser_simply_parse_file ((const guchar *) css_filename, CR_AUTO, &css_file_contents) (hoping for autodetection of the CSS file's encoding), but it always returns an error code. Reason is that cr_om_parser_simply_parse_file calls cr_parser_parse_file calls cr_tknzr_new_from_uri calls cr_input_new_from_uri calls cr_input_new_from_buf calls cr_enc_handler_get_instance which doesn't know about the encoding!
I think that we shoulnt't try to implement charset detection inside libcroco. To handle @charset rule is Ok, it's not really detection, but it's all. Even mozilla has difficulties recognizing a latin1 from an utf8 page sometimes (or maybe it was the content-encoding header from the web server that was wrong I dont know). I'm against the CR_AUTO flag and advocate some CR_DEFAULT, formally equivalent to CR_UTF_8.
I'm not sure that libcroco should have to deal with encodings *at all*. Libcroco should assume that its input comes in some well-defined encoding (most likely, UTF-8), and put the burden of determining the file's encoding to a higher level, which might have some knowledge of the document's encoding. For instance, it is possible that the css snippet is inside of a UTF-8 encoded XML document, or that a HTTP header says that it is iso-8859-1. The responsibility should be on the invoking application to convert the CSS into libcroco's expected encoding.
And what do you do of the @charset command inside the stylesheet ? Are you asking the UA to parse a little bit of CSS to handle this ?
That's a good point. However, (if we were to follow my suggestion), we might be able to get away with pushing the responsibility to the user agent: http://www.w3.org/International/questions/qa-css-charset "Only one @charset rule may appear in an external style sheet and it must appear at the very start of the document. It must not be preceded by any characters, not even comments [other than byte-order markers]." If I understand the specification correctly, the CSS 2.1 spec seems to delegate almost *all* of the responsibility of determining the character encoding of a CSS snippet to the user-agent. Of their list of 5 priorities, libcroco simply cannot know about #1, #3, or #4. #5 is a fall-back if nothing else is known. So that just leaves #2 as nebulous. Furthermore, "User agents must ignore style sheets in unknown encodings." http://www.w3.org/TR/CSS21/syndata.html#q23
libcroco is not under development anymore. Its codebase has been archived. Closing this report as WONTFIX as part of Bugzilla Housekeeping to reflect reality. Please feel free to reopen this ticket (or rather transfer the project to GNOME Gitlab, as GNOME Bugzilla is being shut down) if anyone takes the responsibility for active development again.