GNOME Bugzilla – Bug 371388
Add Thai langauage engine
Last modified: 2006-11-28 17:57:33 UTC
According to a discussion via e-mails with Behdad, I propose a patch to add Thai language engine, based on libthai library [1], so that Thai text is properly line-wrapped, and word-wise caret movement is enabled. The code is taken from pango-libthai project, a sub-project under libthai, and adjusted according to Behdad's suggestions. The language engine will be built only if libthai is available. Link: [1] http://libthai.sourceforge.net
Created attachment 76067 [details] [review] Patch to add Thai lang engine, and update Thai sample text Note that Thai sample text is also updated a little bit, to demonstrate non-TIS-620 character support (in this case, double quotes).
Thanks Thep. I see you have switched to using th_uni2tis() to convert to TIS. I still don't understand how that solves the problem that not every Unicode character is convertable to TIS. Moreover, for chars that are convertable to TIS, like a period (is it?) we are creating word boundaries on both sides. That doesn't make much sense to me. I think your previous use of g_iconv or g_convert was fine. Just repeat that until the input string is exhausted. Something like: start = text; while (start < text + len) { use g_iconv to convert start to TIS let clen be the length of the converted portion of input break converted part start += clen; start = g_utf8_next (start); /* skip over unconvertable char */ } The opened GIconv struct can be cached in a static variable.
(In reply to comment #2) > I see you have switched to using th_uni2tis() to convert to TIS. > I still don't understand how that solves the problem that not every Unicode > character is convertable to TIS. th_uni2tis() returns a code representing unknown value for failed characters. The dummy characters are just there to keep character positions, and th_brk() will treat them as non-Thai characters when determining word boundaries. > Moreover, for chars that are convertable to > TIS, like a period (is it?) we are creating word boundaries on both sides. > That doesn't make much sense to me. This is limitation of current th_brk() implementation. I recognize this issue and have a plan to address it in next version. > I think your previous use of g_iconv or g_convert was fine. Just repeat that > until the input string is exhausted. Something like: > > start = text; > while (start < text + len) { > use g_iconv to convert start to TIS > let clen be the length of the converted portion of input > > break converted part > > start += clen; > start = g_utf8_next (start); /* skip over unconvertable char */ > } > > The opened GIconv struct can be cached in a static variable. Well, it's somewhat equivalent. I can use either method, g_iconv() or th_uni2tis(). However, th_uni2tis is already implemented with static table lookup. Besides, giving th_brk() more context, rather than chunk by chunk, can increase the word analysis precision, at least in theory.
(In reply to comment #3) > (In reply to comment #2) > > > I see you have switched to using th_uni2tis() to convert to TIS. > > I still don't understand how that solves the problem that not every Unicode > > character is convertable to TIS. > > th_uni2tis() returns a code representing unknown value for failed characters. > The dummy characters are just there to keep character positions, and th_brk() > will treat them as non-Thai characters when determining word boundaries. Fine. But it cannot differentiate between any non-Thai character then. For example, the Unicode algorithm doesn't allow a line break after '(' or before ')'. It's always best to just override what is necessary and leave the rest to Pango's default_break. > > I think your previous use of g_iconv or g_convert was fine. Just repeat that > > until the input string is exhausted. Something like: > > > > start = text; > > while (start < text + len) { > > use g_iconv to convert start to TIS > > let clen be the length of the converted portion of input > > > > break converted part > > > > start += clen; > > start = g_utf8_next (start); /* skip over unconvertable char */ > > } > > > > The opened GIconv struct can be cached in a static variable. > > Well, it's somewhat equivalent. I can use either method, g_iconv() or > th_uni2tis(). However, th_uni2tis is already implemented with static table > lookup. Besides, giving th_brk() more context, rather than chunk by chunk, can > increase the word analysis precision, at least in theory. Right, but no context is left when th_uni2tis converts all unconvertable chars to a single code point. As for the static table, that's true, but not a priority.
(In reply to comment #4) > (In reply to comment #3) > > (In reply to comment #2) > > > > > I see you have switched to using th_uni2tis() to convert to TIS. > > > I still don't understand how that solves the problem that not every Unicode > > > character is convertable to TIS. > > > > th_uni2tis() returns a code representing unknown value for failed characters. > > The dummy characters are just there to keep character positions, and th_brk() > > will treat them as non-Thai characters when determining word boundaries. > > Fine. But it cannot differentiate between any non-Thai character then. For > example, the Unicode algorithm doesn't allow a line break after '(' or before > ')'. It's always best to just override what is necessary and leave the rest to > Pango's default_break. From its design, it tries to cover relevant punctuation marks included in US-ASCII, although it's not fully implemented yet in current version. > > > I think your previous use of g_iconv or g_convert was fine. Just repeat that > > > until the input string is exhausted. Something like: > > > > > > start = text; > > > while (start < text + len) { > > > use g_iconv to convert start to TIS > > > let clen be the length of the converted portion of input > > > > > > break converted part > > > > > > start += clen; > > > start = g_utf8_next (start); /* skip over unconvertable char */ > > > } > > > > > > The opened GIconv struct can be cached in a static variable. > > > > Well, it's somewhat equivalent. I can use either method, g_iconv() or > > th_uni2tis(). However, th_uni2tis is already implemented with static table > > lookup. Besides, giving th_brk() more context, rather than chunk by chunk, can > > increase the word analysis precision, at least in theory. > > Right, but no context is left when th_uni2tis converts all unconvertable chars > to a single code point. As for the static table, that's true, but not a > priority. Even so, the treatment as "unknown char" is still informative. As I said, all US-ASCII characters are considered convertible. So, they are still meaningful. For those "unknown chars", they can still be treated like placeholders in naive grammatical rules, for example.
Ok, I'm going to commit this as is. Futher improvements can be committed later.
Thep, if you happen to improve the module, or if see a need to do so, please file another bug (specifically about the conversion stuff discussed above). Thanks for your work! 2006-11-27 Behdad Esfahbod <behdad@gnome.org> Bug 371388 – Add Thai langauage engine Patch from Theppitak Karoonboonyanan * configure.in: Look for libthai and enable thai-lang module. * modules/thai/Makefile.am: Hook thai-lang module. * modules/thai/thai-lang.c: New Thai language engine that uses libthai to do dictionary-based Thai line-breaking. * examples/test-thai.txt: Improved.