Bug 371388 – Add Thai langauage engine

After an evaluation, GNOME has moved from Bugzilla to GitLab. Learn more about GitLab.
No new issues can be reported in GNOME Bugzilla anymore.
To report an issue in a GNOME project, go to GNOME GitLab.
Do not go to GNOME Gitlab for: Bluefish, Doxygen, GnuCash, GStreamer, java-gnome, LDTP, NetworkManager, Tomboy.

Bug 371388 - Add Thai langauage engine


Summary:	Add Thai langauage engine


Status:	RESOLVED FIXED

Product:	pango
Classification:	Platform
Component:	general
Version:	1.15.x
Hardware:	Other Linux

Importance:	Normal enhancement
Target Milestone:	---
Assigned To:	pango-maint
QA Contact:	pango-maint

URL:
Whiteboard:

Depends on:
Blocks:

Reported:	2006-11-06 06:14 UTC by Theppitak Karoonboonyanan
Modified:	2006-11-28 17:57 UTC

See Also:
GNOME target:	---
GNOME version:	---

Attachments
Patch to add Thai lang engine, and update Thai sample text (12.01 KB, patch) 2006-11-06 06:17 UTC, Theppitak Karoonboonyanan	committed	Details \| Review

Description Theppitak Karoonboonyanan 2006-11-06 06:14:02 UTC

According to a discussion via e-mails with Behdad, I propose a patch to add Thai language engine, based on libthai library [1], so that Thai text is properly line-wrapped, and word-wise caret movement is enabled. The code is taken from pango-libthai project, a sub-project under libthai, and adjusted according to Behdad's suggestions.

The language engine will be built only if libthai is available.

Link:
[1] http://libthai.sourceforge.net

Comment 1 Theppitak Karoonboonyanan 2006-11-06 06:17:28 UTC

Created attachment 76067 [details] [review]
Patch to add Thai lang engine, and update Thai sample text

Note that Thai sample text is also updated a little bit, to demonstrate non-TIS-620 character support (in this case, double quotes).

Comment 2 Behdad Esfahbod 2006-11-06 23:51:38 UTC

Thanks Thep.  I see you have switched to using th_uni2tis() to convert to TIS.  I still don't understand how that solves the problem that not every Unicode character is convertable to TIS.  Moreover, for chars that are convertable to TIS, like a period (is it?) we are creating word boundaries on both sides.  That doesn't make much sense to me.

I think your previous use of g_iconv or g_convert was fine.  Just repeat that until the input string is exhausted.  Something like:

  start = text;
  while (start < text + len) {
    use g_iconv to convert start to TIS
    let clen be the length of the converted portion of input

    break converted part

    start += clen;
    start = g_utf8_next (start); /* skip over unconvertable char */
  }

The opened GIconv struct can be cached in a static variable.

Comment 3 Theppitak Karoonboonyanan 2006-11-07 01:52:33 UTC

(In reply to comment #2)

> I see you have switched to using th_uni2tis() to convert to TIS. 
> I still don't understand how that solves the problem that not every Unicode
> character is convertable to TIS.

th_uni2tis() returns a code representing unknown value for failed characters. The dummy characters are just there to keep character positions, and th_brk() will treat them as non-Thai characters when determining word boundaries.

>  Moreover, for chars that are convertable to
> TIS, like a period (is it?) we are creating word boundaries on both sides. 
> That doesn't make much sense to me.

This is limitation of current th_brk() implementation. I recognize this issue and have a plan to address it in next version.

> I think your previous use of g_iconv or g_convert was fine.  Just repeat that
> until the input string is exhausted.  Something like:
> 
>   start = text;
>   while (start < text + len) {
>     use g_iconv to convert start to TIS
>     let clen be the length of the converted portion of input
> 
>     break converted part
> 
>     start += clen;
>     start = g_utf8_next (start); /* skip over unconvertable char */
>   }
> 
> The opened GIconv struct can be cached in a static variable.

Well, it's somewhat equivalent. I can use either method, g_iconv() or th_uni2tis(). However, th_uni2tis is already implemented with static table lookup. Besides, giving th_brk() more context, rather than chunk by chunk, can increase the word analysis precision, at least in theory.

Comment 4 Behdad Esfahbod 2006-11-07 02:20:24 UTC

(In reply to comment #3)
> (In reply to comment #2)
> 
> > I see you have switched to using th_uni2tis() to convert to TIS. 
> > I still don't understand how that solves the problem that not every Unicode
> > character is convertable to TIS.
> 
> th_uni2tis() returns a code representing unknown value for failed characters.
> The dummy characters are just there to keep character positions, and th_brk()
> will treat them as non-Thai characters when determining word boundaries.

Fine.  But it cannot differentiate between any non-Thai character then.  For example, the Unicode algorithm doesn't allow a line break after '(' or before ')'.  It's always best to just override what is necessary and leave the rest to Pango's default_break.

> > I think your previous use of g_iconv or g_convert was fine.  Just repeat that
> > until the input string is exhausted.  Something like:
> > 
> >   start = text;
> >   while (start < text + len) {
> >     use g_iconv to convert start to TIS
> >     let clen be the length of the converted portion of input
> > 
> >     break converted part
> > 
> >     start += clen;
> >     start = g_utf8_next (start); /* skip over unconvertable char */
> >   }
> > 
> > The opened GIconv struct can be cached in a static variable.
> 
> Well, it's somewhat equivalent. I can use either method, g_iconv() or
> th_uni2tis(). However, th_uni2tis is already implemented with static table
> lookup. Besides, giving th_brk() more context, rather than chunk by chunk, can
> increase the word analysis precision, at least in theory.

Right, but no context is left when th_uni2tis converts all unconvertable chars to a single code point.  As for the static table, that's true, but not a priority.

Comment 5 Theppitak Karoonboonyanan 2006-11-07 04:43:49 UTC

(In reply to comment #4)
> (In reply to comment #3)
> > (In reply to comment #2)
> > 
> > > I see you have switched to using th_uni2tis() to convert to TIS. 
> > > I still don't understand how that solves the problem that not every Unicode
> > > character is convertable to TIS.
> > 
> > th_uni2tis() returns a code representing unknown value for failed characters.
> > The dummy characters are just there to keep character positions, and th_brk()
> > will treat them as non-Thai characters when determining word boundaries.
> 
> Fine.  But it cannot differentiate between any non-Thai character then.  For
> example, the Unicode algorithm doesn't allow a line break after '(' or before
> ')'.  It's always best to just override what is necessary and leave the rest to
> Pango's default_break.

From its design, it tries to cover relevant punctuation marks included in US-ASCII, although it's not fully implemented yet in current version.

> > > I think your previous use of g_iconv or g_convert was fine.  Just repeat that
> > > until the input string is exhausted.  Something like:
> > > 
> > >   start = text;
> > >   while (start < text + len) {
> > >     use g_iconv to convert start to TIS
> > >     let clen be the length of the converted portion of input
> > > 
> > >     break converted part
> > > 
> > >     start += clen;
> > >     start = g_utf8_next (start); /* skip over unconvertable char */
> > >   }
> > > 
> > > The opened GIconv struct can be cached in a static variable.
> > 
> > Well, it's somewhat equivalent. I can use either method, g_iconv() or
> > th_uni2tis(). However, th_uni2tis is already implemented with static table
> > lookup. Besides, giving th_brk() more context, rather than chunk by chunk, can
> > increase the word analysis precision, at least in theory.
> 
> Right, but no context is left when th_uni2tis converts all unconvertable chars
> to a single code point.  As for the static table, that's true, but not a
> priority.

Even so, the treatment as "unknown char" is still informative. As I said, all US-ASCII characters are considered convertible. So, they are still meaningful. For those "unknown chars", they can still be treated like placeholders in naive grammatical rules, for example.

Comment 6 Behdad Esfahbod 2006-11-27 20:53:55 UTC

Ok, I'm going to commit this as is.  Futher improvements can be committed later.

Comment 7 Behdad Esfahbod 2006-11-27 22:03:25 UTC

Thep, if you happen to improve the module, or if see a need to do so, please file another bug (specifically about the conversion stuff discussed above).  Thanks for your work!

2006-11-27  Behdad Esfahbod  <behdad@gnome.org>

        Bug 371388 – Add Thai langauage engine
        Patch from Theppitak Karoonboonyanan

        * configure.in: Look for libthai and enable thai-lang module.
        * modules/thai/Makefile.am: Hook thai-lang module.

        * modules/thai/thai-lang.c: New Thai language engine that uses libthai
        to do dictionary-based Thai line-breaking.

        * examples/test-thai.txt: Improved.