After an evaluation, GNOME has moved from Bugzilla to GitLab. Learn more about GitLab.
No new issues can be reported in GNOME Bugzilla anymore.
To report an issue in a GNOME project, go to GNOME GitLab.
Do not go to GNOME Gitlab for: Bluefish, Doxygen, GnuCash, GStreamer, java-gnome, LDTP, NetworkManager, Tomboy.
Bug 371388 - Add Thai langauage engine
Add Thai langauage engine
Status: RESOLVED FIXED
Product: pango
Classification: Platform
Component: general
1.15.x
Other Linux
: Normal enhancement
: ---
Assigned To: pango-maint
pango-maint
Depends on:
Blocks:
 
 
Reported: 2006-11-06 06:14 UTC by Theppitak Karoonboonyanan
Modified: 2006-11-28 17:57 UTC
See Also:
GNOME target: ---
GNOME version: ---


Attachments
Patch to add Thai lang engine, and update Thai sample text (12.01 KB, patch)
2006-11-06 06:17 UTC, Theppitak Karoonboonyanan
committed Details | Review

Description Theppitak Karoonboonyanan 2006-11-06 06:14:02 UTC
According to a discussion via e-mails with Behdad, I propose a patch to add Thai language engine, based on libthai library [1], so that Thai text is properly line-wrapped, and word-wise caret movement is enabled. The code is taken from pango-libthai project, a sub-project under libthai, and adjusted according to Behdad's suggestions.

The language engine will be built only if libthai is available.

Link:
[1] http://libthai.sourceforge.net
Comment 1 Theppitak Karoonboonyanan 2006-11-06 06:17:28 UTC
Created attachment 76067 [details] [review]
Patch to add Thai lang engine, and update Thai sample text

Note that Thai sample text is also updated a little bit, to demonstrate non-TIS-620 character support (in this case, double quotes).
Comment 2 Behdad Esfahbod 2006-11-06 23:51:38 UTC
Thanks Thep.  I see you have switched to using th_uni2tis() to convert to TIS.  I still don't understand how that solves the problem that not every Unicode character is convertable to TIS.  Moreover, for chars that are convertable to TIS, like a period (is it?) we are creating word boundaries on both sides.  That doesn't make much sense to me.

I think your previous use of g_iconv or g_convert was fine.  Just repeat that until the input string is exhausted.  Something like:

  start = text;
  while (start < text + len) {
    use g_iconv to convert start to TIS
    let clen be the length of the converted portion of input

    break converted part

    start += clen;
    start = g_utf8_next (start); /* skip over unconvertable char */
  }

The opened GIconv struct can be cached in a static variable.
Comment 3 Theppitak Karoonboonyanan 2006-11-07 01:52:33 UTC
(In reply to comment #2)

> I see you have switched to using th_uni2tis() to convert to TIS. 
> I still don't understand how that solves the problem that not every Unicode
> character is convertable to TIS.

th_uni2tis() returns a code representing unknown value for failed characters. The dummy characters are just there to keep character positions, and th_brk() will treat them as non-Thai characters when determining word boundaries.

>  Moreover, for chars that are convertable to
> TIS, like a period (is it?) we are creating word boundaries on both sides. 
> That doesn't make much sense to me.

This is limitation of current th_brk() implementation. I recognize this issue and have a plan to address it in next version.

> I think your previous use of g_iconv or g_convert was fine.  Just repeat that
> until the input string is exhausted.  Something like:
> 
>   start = text;
>   while (start < text + len) {
>     use g_iconv to convert start to TIS
>     let clen be the length of the converted portion of input
> 
>     break converted part
> 
>     start += clen;
>     start = g_utf8_next (start); /* skip over unconvertable char */
>   }
> 
> The opened GIconv struct can be cached in a static variable.

Well, it's somewhat equivalent. I can use either method, g_iconv() or th_uni2tis(). However, th_uni2tis is already implemented with static table lookup. Besides, giving th_brk() more context, rather than chunk by chunk, can increase the word analysis precision, at least in theory.
Comment 4 Behdad Esfahbod 2006-11-07 02:20:24 UTC
(In reply to comment #3)
> (In reply to comment #2)
> 
> > I see you have switched to using th_uni2tis() to convert to TIS. 
> > I still don't understand how that solves the problem that not every Unicode
> > character is convertable to TIS.
> 
> th_uni2tis() returns a code representing unknown value for failed characters.
> The dummy characters are just there to keep character positions, and th_brk()
> will treat them as non-Thai characters when determining word boundaries.

Fine.  But it cannot differentiate between any non-Thai character then.  For example, the Unicode algorithm doesn't allow a line break after '(' or before ')'.  It's always best to just override what is necessary and leave the rest to Pango's default_break.

> > I think your previous use of g_iconv or g_convert was fine.  Just repeat that
> > until the input string is exhausted.  Something like:
> > 
> >   start = text;
> >   while (start < text + len) {
> >     use g_iconv to convert start to TIS
> >     let clen be the length of the converted portion of input
> > 
> >     break converted part
> > 
> >     start += clen;
> >     start = g_utf8_next (start); /* skip over unconvertable char */
> >   }
> > 
> > The opened GIconv struct can be cached in a static variable.
> 
> Well, it's somewhat equivalent. I can use either method, g_iconv() or
> th_uni2tis(). However, th_uni2tis is already implemented with static table
> lookup. Besides, giving th_brk() more context, rather than chunk by chunk, can
> increase the word analysis precision, at least in theory.

Right, but no context is left when th_uni2tis converts all unconvertable chars to a single code point.  As for the static table, that's true, but not a priority.
Comment 5 Theppitak Karoonboonyanan 2006-11-07 04:43:49 UTC
(In reply to comment #4)
> (In reply to comment #3)
> > (In reply to comment #2)
> > 
> > > I see you have switched to using th_uni2tis() to convert to TIS. 
> > > I still don't understand how that solves the problem that not every Unicode
> > > character is convertable to TIS.
> > 
> > th_uni2tis() returns a code representing unknown value for failed characters.
> > The dummy characters are just there to keep character positions, and th_brk()
> > will treat them as non-Thai characters when determining word boundaries.
> 
> Fine.  But it cannot differentiate between any non-Thai character then.  For
> example, the Unicode algorithm doesn't allow a line break after '(' or before
> ')'.  It's always best to just override what is necessary and leave the rest to
> Pango's default_break.

From its design, it tries to cover relevant punctuation marks included in US-ASCII, although it's not fully implemented yet in current version.

> > > I think your previous use of g_iconv or g_convert was fine.  Just repeat that
> > > until the input string is exhausted.  Something like:
> > > 
> > >   start = text;
> > >   while (start < text + len) {
> > >     use g_iconv to convert start to TIS
> > >     let clen be the length of the converted portion of input
> > > 
> > >     break converted part
> > > 
> > >     start += clen;
> > >     start = g_utf8_next (start); /* skip over unconvertable char */
> > >   }
> > > 
> > > The opened GIconv struct can be cached in a static variable.
> > 
> > Well, it's somewhat equivalent. I can use either method, g_iconv() or
> > th_uni2tis(). However, th_uni2tis is already implemented with static table
> > lookup. Besides, giving th_brk() more context, rather than chunk by chunk, can
> > increase the word analysis precision, at least in theory.
> 
> Right, but no context is left when th_uni2tis converts all unconvertable chars
> to a single code point.  As for the static table, that's true, but not a
> priority.

Even so, the treatment as "unknown char" is still informative. As I said, all US-ASCII characters are considered convertible. So, they are still meaningful. For those "unknown chars", they can still be treated like placeholders in naive grammatical rules, for example.
Comment 6 Behdad Esfahbod 2006-11-27 20:53:55 UTC
Ok, I'm going to commit this as is.  Futher improvements can be committed later.
Comment 7 Behdad Esfahbod 2006-11-27 22:03:25 UTC
Thep, if you happen to improve the module, or if see a need to do so, please file another bug (specifically about the conversion stuff discussed above).  Thanks for your work!

2006-11-27  Behdad Esfahbod  <behdad@gnome.org>

        Bug 371388 – Add Thai langauage engine
        Patch from Theppitak Karoonboonyanan

        * configure.in: Look for libthai and enable thai-lang module.
        * modules/thai/Makefile.am: Hook thai-lang module.

        * modules/thai/thai-lang.c: New Thai language engine that uses libthai
        to do dictionary-based Thai line-breaking.

        * examples/test-thai.txt: Improved.