GNOME Bugzilla – Bug 552357
Renderer engine not support any encoding except UTF8
Last modified: 2009-01-23 12:13:41 UTC
Rendering engine not suuport convert page body to utf and backwards convert qury params to encoding source page. I think it can be resolved with patch http://code.google.com/p/skybrowser/source/browse/trunk/patches/gtkhtml-r8991-fullencoding.patch. Please review this patch and tell me your opinion. This patch add: support http-equiv and set encoding if it exist in http - headers re-coding resulted query from form If I find error in my patch - I save resolution in http://code.google.com/p/skybrowser/source/browse/trunk/patches/ Thank you.
Created attachment 119220 [details] [review] patch to support
Thanks for you effort. I found some issues with your patch: 0) please attach any changes here, it's better trackable (you did already, which is fine, thanks). a) your coding style is quite different from the one usually used in the code. I know it's not same in older sources, but please try to write in similar coding style as is written new code, thanks. Please read this for further information http://www.gnome.org/projects/evolution/patch.shtml b) do not use comments like // , use /* */ c) do not declare variables in the middle of the block, declare them at the beginning. d) please add ChangeLog entry too, in same folder as your changes are, thanks. e) you name function 'add_unichar' but its parameter is gchar, furthermore, in that function you do 'c++;' which doesn't make any sense to me. f) there are compiler warnings after your patch: htmlembedded.c: In function ‘html_embedded_encode_string’: htmlembedded.c:235: warning: initialization discards qualifiers from pointer target type htmltokenizer.c: In function ‘ConvertIconv’: htmltokenizer.c:468: warning: initialization discards qualifiers from pointer target type htmltokenizer.c: In function ‘html_tokenizer_real_get_content_type’: htmltokenizer.c:514: warning: return discards qualifiers from pointer target type htmltokenizer.c: In function ‘genIconvFrom’: htmltokenizer.c:632: warning: passing argument 1 of ‘charset_is_utf8’ discards qualifiers from pointer target type htmltokenizer.c: In function ‘genIconvTo’: htmltokenizer.c:645: warning: passing argument 1 of ‘charset_is_utf8’ discards qualifiers from pointer target type testgtkhtml.c: In function ‘got_data’: testgtkhtml.c:672: warning: initialization discards qualifiers from pointer target type I didn't check functionality of the code yet, because there is some possibility that the changes could make difference.
Created attachment 119851 [details] [review] Changed previous patch.
+ /* I think gchar is typedef from unsigned char */ + iconv (iconv_cd,(char **)¤t, &currlength, ... it's because you declared that a "const gchar *", but the function expects non-const. struct _HTMLTokenizerPrivate::content_type should be declared as nonconst, because you own that pointer and you are responsible for that. You do not free that pointer too, I didn't find such thing in html_tokenizer_finalize. You should fre that pointer in html_tokenizer_real_change as well. I hope I'm not wrong, but I think the content_type is case insensitive, which means you should store the content type in lowercase, to have your functions running properly. After call of html_engine_get_content_type you leak the returned pointer. Why does that return new pointer? Isn't it easier to return the object's pointer as const gchar *? in 'is_text', rather looking for "=text/" instead of "text/"? Also, return gboolean instead of gint, please. getEncoding should return const gchar * html_tokenizer_real_change should have "const gchar *content_type", it doesn't modify that parameter. You left there second half of e) from comment #2, the "c++;" statement. You use the iconv library, I think, isn't it safer to use than from GLib? when I look on documentation of g_iconv, then they say it has some fallback where the implementation isn't available in the system. Basic rules for strings: - if you do not modify the content of the parameter, declare it as const. - if you return pointer the object/function owns, let the return pointer be const too. - if there is not necessary to reallocate the pointer (the object will not change/free it meanwhile), then use the inner pointer.
Created attachment 119958 [details] [review] Rework patch Thank you. I remove content_type from gtkhtml, because it can be received from htmlengine. I not change is_text - because this function receive content_type without '=' when content_type be got from http-headers.
I'm sorry for a late response, I was away for some time. I tried to compile and see what it does. The good news is that the gtkhtml compiles smoothly. The bad news is that Evolution, which depends on the GtkHTML at the moment, doesn't compile. You changed API, you removed gtk_html_set_default_content_type, and you changed void (*begin) (HTMLTokenizer *, gchar *content_type); in htmltokenizer.h. I agree with the later change, but it's the API change, which we should not do, it would be great to stay backward compatible, even for price of such bad design they did when creating the tokenizer. Is it bad idea to distinguish between default content type and actual content type which is used during the parsing? When I changed the above things a bit (just to be able to compile Evolution), then it seems as working fine, on the first look at least. Good work Denis.
Created attachment 122236 [details] [review] alternative patch Thank you. I rename gtk_html_set_default_content_type to gtk_html_set_content_type only for unification with gtk_html_get_content_type. And I have a think its function really not used because codepage ignored in code. Now I think rename been mistake. And rename back. About change prototype 'begin' and as result 'html_tokenizer_begin' I think this interface can be used only internally and I think its not critically. I am right?
Created attachment 122333 [details] test eml in iso-8859-2 It's not critical, you've right, but I like to be without warnings, thus one will not overlook some important messages from the compile time. With your last patch I see this on the console when compiling evolution: e-searching-tokenizer.c: In function ‘e_searching_tokenizer_class_init’: e-searching-tokenizer.c:1008: warning: assignment from incompatible pointer type Thus the tokenizer itself isn't the private thing of the GtkHTML. And it was the reason why I was talking about bad design they did, which we are supposed to follow now, even we know the change to the 'const' is the right thing. Or change all the other software which uses it's own tokenizers to be correct (maybe only Evolution uses it, but maybe not). The patch doesn't work for the mail in the attachment.
The interesting thing is that it works fine without the patch, maybe only some little changes are required, or maybe no changes at all? Because you know, the mail is in the iso-8859-2 and not in UTF8, and is shown just fine without the patch.
Created attachment 122431 [details] screenshot maybe describe error Before my changes system used this code: - /* - chars in range 128 - 159 are control characters in unicode, - but most browsers treat them as windows 1252 - encoded characters and translate them to unicode - it's broken, but we do the same here - */ - if (wc > 127 && wc < 160) - wc = win1252_to_unicode [wc - 128]; I delete this code because all text before rendering(in parse stage(element_parse_meta) or from gtk_html_set_default_content_type determine codepage) recoded to utf8. What you mean on "doesn't work"? Maybe when rendering in evolution parse stage not be run or not determine codepage? Please, tell more information. I save "test eml" to file and run "./testgtkhtml file:/../../../../Desktop/gn552357_test.html" and all looks ok.
Created attachment 122443 [details] in evo itself I'm testing it in evo itself, just compiled gtkhtml with a patch and run evo and opened the mail. This is what it shows. (It yells on console about ESearchTokenizer, but it doesn't have any impact on our issue).
I found reason this behavior. Before send text to gtkhtml in evolution text been recoded to utf8. I think need patch to evolution for disable convert. I work on this. But I have question: I can in this "bug" add patch to more than one product? I think in evolution not needed recoding text before send to gtkhtml. It more universal behavior. I am right?
(In reply to comment #12) > I found reason this behavior. Before send text to gtkhtml in evolution text > been recoded to utf8. I think need patch to evolution for disable convert. I > work on this. But I have question: I can in this "bug" add patch to more than > one product? Yes, Evolution-Data-Server, Evolution, Evolution-Exchange and GtkHTML are quite tight together, no problem to upload more patches to each of them into one bug. > I think in evolution not needed recoding text before send to gtkhtml. It more > universal behavior. I am right? It depends. What is the place we are talking about?
Created attachment 122853 [details] [review] Remove description content_type from html before send to gtkhtml. Patch I tested on evolution 2.22.3.1 and 2.24. It must work with patched and not patched gtkhtml. In patched version evolution before send to gtkhtml in html replaced codepage to utf8.
Created attachment 123555 [details] [review] Remove description before send little change - resolve error if text not null terminated
g_strndup is a function you are looking for, but, I really do not think this is the way how to do that, I'm sorry. Further more, data to stream can be written asynchronously, in chunks, which will break your code logic. What did you mean with > I think in evolution not needed recoding text before send to gtkhtml. I didn't get what place/code we are talking about. Do you know Evolution uses GtkHtml also in tasks/memos preview panels?
I agree. May be need add to gtkhtml new constructor with one new parameter 'type of engine' - with convert encoding and without(default if use old constructor)? I think it resolve all problems with all gtkhtml based application. And In this parameter we can choose type of rendering: old render, with convert, with webkit or gekko? It's only idea to rewrite(improve) gtkhtml as proxy to other rendering engines.
> Do you know Evolution uses GtkHtml also in tasks/memos preview panels? I not sure, I think yes. I found use gtkhtml in mail component, but other components can use some functional in mail. And for 100% reliability we can only enable/disable some functionality in gtkhtml.
Created attachment 123646 [details] [review] Example patch for previous comment Add function(gtk_html_set_default_engine) for disable change charset in parse stage. Default(FALSE): disable
You are too quick :) I was thinking about new GtkHTMLBeginFlags flag to gtk_html_begin_full, forcing conversions based on the charset in the body. (In reply to comment #17) > And In this parameter we can choose type of rendering: old render, with > convert, with webkit or gekko? It's only idea to rewrite(improve) gtkhtml as > proxy to other rendering engines. As I understand the effort about WebKit, the desired result is to drop GtkHtml at all, not many people are maintaining it, even less understand it.
Taking the patch off my radar, as its a example patch.
Created attachment 124347 [details] [review] New flag to change type of change content_type I add flag for enable change content_type in parse document. By default its disiabled. Also this flag can be changed/received in gtkhtml_(set/get)_engine_type. I tested it in evolution - used old engine without change view and change code in evolution(patch for evolution not needed). And in standard test program "testgtkhtml http://google.com". I think all right.
OK, seems to be fine, thanks for your work. I'll commit to trunk. I'll also attach a little patch to evolution, to quiet compiler: e-searching-tokenizer.c: In function ‘e_searching_tokenizer_class_init’: e-searching-tokenizer.c:1008: warning: assignment from incompatible pointer type I'll also fix this compiler warning before committing: testgtkhtml.c: In function ‘goto_url’: testgtkhtml.c:856: warning: passing argument 2 of ‘gtk_html_begin_content’ discards qualifiers from pointer target type
Committed to trunk. Committed revision 9061.
Created attachment 124370 [details] evo patch - just for a reference
evo patch committed to trunk. Committed revision 36868.
I found some issues with your patch while running evolution under valgrind and moving from one mail to the another, please look at it. Thanks in advance. ==5014== Thread 1: ==5014== Invalid read of size 1 ==5014== at 0x37A9E80A1B: strcspn (in /lib64/libc-2.9.so) ==5014== by 0x765948D: html_tokenizer_convert_entity (htmltokenizer.c:494) ==5014== by 0x76596D4: html_tokenizer_converted_token (htmltokenizer.c:550) ==5014== by 0x7659983: html_tokenizer_real_next_token (htmltokenizer.c:611) ==5014== by 0x111833C2: e_searching_tokenizer_next_token (e-searching-tokenizer.c:1136) ==5014== by 0x765C139: html_tokenizer_next_token (htmltokenizer.c:1572) ==5014== by 0x7617064: new_parse_body (htmlengine.c:1385) ==5014== by 0x7622213: html_engine_timer_event (htmlengine.c:4908) ==5014== by 0x7628C0C: html_engine_flush (htmlengine.c:6866) ==5014== by 0x75EC127: gtk_html_flush (gtkhtml.c:6134) ==5014== by 0x111C029E: emhs_sync_flush (em-html-stream.c:130) ==5014== by 0x111D3ADF: emss_process_message (em-sync-stream.c:83) ==5014== Address 0xf8dfb7a is 0 bytes after a block of size 98 alloc'd ==5014== at 0x4A0739E: malloc (vg_replace_malloc.c:207) ==5014== by 0x81D01EE: g_malloc (gmem.c:131) ==5014== by 0x7659663: convert_text_encoding (htmltokenizer.c:538) ==5014== by 0x76596CC: html_tokenizer_converted_token (htmltokenizer.c:550) ==5014== by 0x7659983: html_tokenizer_real_next_token (htmltokenizer.c:611) ==5014== by 0x111833C2: e_searching_tokenizer_next_token (e-searching-tokenizer.c:1136) ==5014== by 0x765C139: html_tokenizer_next_token (htmltokenizer.c:1572) ==5014== by 0x7617064: new_parse_body (htmlengine.c:1385) ==5014== by 0x7622213: html_engine_timer_event (htmlengine.c:4908) ==5014== by 0x7628C0C: html_engine_flush (htmlengine.c:6866) ==5014== by 0x75EC127: gtk_html_flush (gtkhtml.c:6134) ==5014== by 0x111C029E: emhs_sync_flush (em-html-stream.c:130) ==5014== ==5014== Invalid read of size 1 ==5014== at 0x37A9E80A04: strcspn (in /lib64/libc-2.9.so) ==5014== by 0x765948D: html_tokenizer_convert_entity (htmltokenizer.c:494) ==5014== by 0x76596D4: html_tokenizer_converted_token (htmltokenizer.c:550) ==5014== by 0x7659983: html_tokenizer_real_next_token (htmltokenizer.c:611) ==5014== by 0x111833C2: e_searching_tokenizer_next_token (e-searching-tokenizer.c:1136) ==5014== by 0x765C139: html_tokenizer_next_token (htmltokenizer.c:1572) ==5014== by 0x7617064: new_parse_body (htmlengine.c:1385) ==5014== by 0x7622213: html_engine_timer_event (htmlengine.c:4908) ==5014== by 0x7628C0C: html_engine_flush (htmlengine.c:6866) ==5014== by 0x75EC127: gtk_html_flush (gtkhtml.c:6134) ==5014== by 0x111C029E: emhs_sync_flush (em-html-stream.c:130) ==5014== by 0x111D3ADF: emss_process_message (em-sync-stream.c:83) ==5014== Address 0xf8dfb7b is 1 bytes after a block of size 98 alloc'd ==5014== at 0x4A0739E: malloc (vg_replace_malloc.c:207) ==5014== by 0x81D01EE: g_malloc (gmem.c:131) ==5014== by 0x7659663: convert_text_encoding (htmltokenizer.c:538) ==5014== by 0x76596CC: html_tokenizer_converted_token (htmltokenizer.c:550) ==5014== by 0x7659983: html_tokenizer_real_next_token (htmltokenizer.c:611) ==5014== by 0x111833C2: e_searching_tokenizer_next_token (e-searching-tokenizer.c:1136) ==5014== by 0x765C139: html_tokenizer_next_token (htmltokenizer.c:1572) ==5014== by 0x7617064: new_parse_body (htmlengine.c:1385) ==5014== by 0x7622213: html_engine_timer_event (htmlengine.c:4908) ==5014== by 0x7628C0C: html_engine_flush (htmlengine.c:6866) ==5014== by 0x75EC127: gtk_html_flush (gtkhtml.c:6134) ==5014== by 0x111C029E: emhs_sync_flush (em-html-stream.c:130) ==5014== ==5014== Invalid read of size 1 ==5014== at 0x37A9E80A0B: strcspn (in /lib64/libc-2.9.so) ==5014== by 0x765948D: html_tokenizer_convert_entity (htmltokenizer.c:494) ==5014== by 0x76596D4: html_tokenizer_converted_token (htmltokenizer.c:550) ==5014== by 0x7659983: html_tokenizer_real_next_token (htmltokenizer.c:611) ==5014== by 0x111833C2: e_searching_tokenizer_next_token (e-searching-tokenizer.c:1136) ==5014== by 0x765C139: html_tokenizer_next_token (htmltokenizer.c:1572) ==5014== by 0x7617064: new_parse_body (htmlengine.c:1385) ==5014== by 0x7622213: html_engine_timer_event (htmlengine.c:4908) ==5014== by 0x7628C0C: html_engine_flush (htmlengine.c:6866) ==5014== by 0x75EC127: gtk_html_flush (gtkhtml.c:6134) ==5014== by 0x111C029E: emhs_sync_flush (em-html-stream.c:130) ==5014== by 0x111D3ADF: emss_process_message (em-sync-stream.c:83) ==5014== Address 0xf8dfb7c is 2 bytes after a block of size 98 alloc'd ==5014== at 0x4A0739E: malloc (vg_replace_malloc.c:207) ==5014== by 0x81D01EE: g_malloc (gmem.c:131) ==5014== by 0x7659663: convert_text_encoding (htmltokenizer.c:538) ==5014== by 0x76596CC: html_tokenizer_converted_token (htmltokenizer.c:550) ==5014== by 0x7659983: html_tokenizer_real_next_token (htmltokenizer.c:611) ==5014== by 0x111833C2: e_searching_tokenizer_next_token (e-searching-tokenizer.c:1136) ==5014== by 0x765C139: html_tokenizer_next_token (htmltokenizer.c:1572) ==5014== by 0x7617064: new_parse_body (htmlengine.c:1385) ==5014== by 0x7622213: html_engine_timer_event (htmlengine.c:4908) ==5014== by 0x7628C0C: html_engine_flush (htmlengine.c:6866) ==5014== by 0x75EC127: gtk_html_flush (gtkhtml.c:6134) ==5014== by 0x111C029E: emhs_sync_flush (em-html-stream.c:130) ==5014== ==5014== Invalid read of size 1 ==5014== at 0x37A9E80A13: strcspn (in /lib64/libc-2.9.so) ==5014== by 0x765948D: html_tokenizer_convert_entity (htmltokenizer.c:494) ==5014== by 0x76596D4: html_tokenizer_converted_token (htmltokenizer.c:550) ==5014== by 0x7659983: html_tokenizer_real_next_token (htmltokenizer.c:611) ==5014== by 0x111833C2: e_searching_tokenizer_next_token (e-searching-tokenizer.c:1136) ==5014== by 0x765C139: html_tokenizer_next_token (htmltokenizer.c:1572) ==5014== by 0x7617064: new_parse_body (htmlengine.c:1385) ==5014== by 0x7622213: html_engine_timer_event (htmlengine.c:4908) ==5014== by 0x7628C0C: html_engine_flush (htmlengine.c:6866) ==5014== by 0x75EC127: gtk_html_flush (gtkhtml.c:6134) ==5014== by 0x111C029E: emhs_sync_flush (em-html-stream.c:130) ==5014== by 0x111D3ADF: emss_process_message (em-sync-stream.c:83) ==5014== Address 0xf8dfb7d is 3 bytes after a block of size 98 alloc'd ==5014== at 0x4A0739E: malloc (vg_replace_malloc.c:207) ==5014== by 0x81D01EE: g_malloc (gmem.c:131) ==5014== by 0x7659663: convert_text_encoding (htmltokenizer.c:538) ==5014== by 0x76596CC: html_tokenizer_converted_token (htmltokenizer.c:550) ==5014== by 0x7659983: html_tokenizer_real_next_token (htmltokenizer.c:611) ==5014== by 0x111833C2: e_searching_tokenizer_next_token (e-searching-tokenizer.c:1136) ==5014== by 0x765C139: html_tokenizer_next_token (htmltokenizer.c:1572) ==5014== by 0x7617064: new_parse_body (htmlengine.c:1385) ==5014== by 0x7622213: html_engine_timer_event (htmlengine.c:4908) ==5014== by 0x7628C0C: html_engine_flush (htmlengine.c:6866) ==5014== by 0x75EC127: gtk_html_flush (gtkhtml.c:6134) ==5014== by 0x111C029E: emhs_sync_flush (em-html-stream.c:130) ==5014== ==5014== Source and destination overlap in memcpy(0x161D1544, 0x161D1548, 103) ==5014== at 0x4A07FCA: memcpy (mc_replace_strmem.c:402) ==5014== by 0x7659464: html_tokenizer_convert_entity (htmltokenizer.c:490) ==5014== by 0x76596D4: html_tokenizer_converted_token (htmltokenizer.c:550) ==5014== by 0x7659983: html_tokenizer_real_next_token (htmltokenizer.c:611) ==5014== by 0x111833C2: e_searching_tokenizer_next_token (e-searching-tokenizer.c:1136) ==5014== by 0x765C139: html_tokenizer_next_token (htmltokenizer.c:1572) ==5014== by 0x7617064: new_parse_body (htmlengine.c:1385) ==5014== by 0x7622213: html_engine_timer_event (htmlengine.c:4908) ==5014== by 0x7628C0C: html_engine_flush (htmlengine.c:6866) ==5014== by 0x75EC127: gtk_html_flush (gtkhtml.c:6134) ==5014== by 0x111C029E: emhs_sync_flush (em-html-stream.c:130) ==5014== by 0x111D3ADF: emss_process_message (em-sync-stream.c:83)
Guys, the htmltokenizer.h changes are an ABI break, especially since Evolution subclasses it. We'll need to bump the soname before tomorrow's release.
I think we don't update so version. Only Evolution affected and we would go with minimum version reqd flag for Evolution and go with same SO version.
That'll work. Minimum version already bumped in Evolution.
Created attachment 124715 [details] [review] Use gperf for generate htmlentity Change file htmlentity.c and resolve issues with memory in html_tokenizer_convert_entity. Please review this function(html_tokenizer_convert_entity). File htmlentity.c generated by gperf - i think it's needed. == htmlentity.c: /* FIXME FIXME this function just sucks. We should use gperf or something instead. */ == I not add(to Makefile) generate this file in each compilation. I think this not needed. After change htmlentity.gperf need only rerun gperf to generate this file.
Created attachment 124719 [details] [review] gperf + memory issue Only change mistake when 2 chars entity not rendered correctly(<)
Thanks for the update, Denis. Valgrind seems to be happy with this patch. One little warning I see when compiling: htmlentity.gperf:114: warning: no previous prototype for ‘html_entity_hash’ I do not know gperf at all, nor about portability and requirements on the build. Thus asking Matt and/or Srag on their thoughts.
(In reply to comment #33) > One little warning I see when compiling: > htmlentity.gperf:114: warning: no previous prototype for ‘html_entity_hash’ Its because 'html_entity_hash' full automatic generated by 'gperf' and in gperf not have parameter "gen with prototype" and not generate prototype by default. For delete this warning need add to htmlentity.c:139(before implementation): ---------- #ifdef __GNUC__ __inline #ifdef __GNUC_STDC_INLINE__ __attribute__ ((__gnu_inline__)) #endif #endif struct _EntityEntry * html_entity_hash (register const char *str, register unsigned int len); ---------- I think is not solution because will be deleted after regenerate with gperf.
May be someone have other solution for this warning?
Committed to trunk. Committed revision 9074. I modified htmltokenizer.h to declare the offending function as 'static', which is enough to make compiler happy. Nonetheless, next file regeneration will lead to the warning again. But I do not care at the moment. I added some changelog entry for you, Denis. Thanks for you help.
err, htmlentity.c it was, not htmltokenizer.h. I'm sorry for confusion.
Hi Denis, can you look at bug #567697, I just tried with code before your change here and it works as expected. After the change, it doesn't. It seems you're not handling non-entities correctly. Thanks.
Created attachment 126982 [details] [review] process entyty error solution Solution for html with bug's - if exist error in entity save all entity in result. But only temporary solution for plain text - replace all entity to his value. Example: < -> < < -> < &aaa; -> &aaa;
I moved the patch to the right bug. (bug #567697) and committed it. Thanks Denis.