Bug 350132 – backspacing doesn't work properly for Arabic

After an evaluation, GNOME has moved from Bugzilla to GitLab. Learn more about GitLab.
No new issues can be reported in GNOME Bugzilla anymore.
To report an issue in a GNOME project, go to GNOME GitLab.
Do not go to GNOME Gitlab for: Bluefish, Doxygen, GnuCash, GStreamer, java-gnome, LDTP, NetworkManager, Tomboy.

Bug 350132 - backspacing doesn't work properly for Arabic


Summary:	backspacing doesn't work properly for Arabic


Status:	RESOLVED OBSOLETE

Product:	pango
Classification:	Platform
Component:	general
Version:	unspecified
Hardware:	Other Linux

Importance:	Normal normal
Target Milestone:	---
Assigned To:	Behdad Esfahbod
QA Contact:	gtk-bugs

URL:
Whiteboard:

Depends on:
Blocks:	Persian

Reported:	2006-08-06 08:50 UTC by Roozbeh Pournader
Modified:	2018-05-22 12:21 UTC

See Also:
GNOME target:	---
GNOME version:	2.15/2.16

Attachments
patch to handle the rest of the trivial cases (1.23 KB, patch) 2006-11-05 17:12 UTC, Roozbeh Pournader	none	Details \| Review
fixed a typo with the previous patch (1.23 KB, patch) 2006-11-05 17:22 UTC, Roozbeh Pournader	none	Details \| Review
fixes a typo with the previous patch (1.23 KB, patch) 2006-11-05 17:24 UTC, Roozbeh Pournader	none	Details \| Review
new patch incorporating suggestions made by Behdad on IRC (2.65 KB, patch) 2006-11-07 11:54 UTC, Roozbeh Pournader	committed	Details \| Review
patch for handling indic NFC (1.46 KB, patch) 2009-10-24 10:48 UTC, Pravin Satpute	none	Details \| Review

Description Roozbeh Pournader 2006-08-06 08:50:59 UTC

Backspacing doesn't work as expected for the Arabic script. Currently, the functions gtk_entry_backspace and gtk_text_buffer_backspace normalize the string to NFD and then remove the last character from the string. This is not intuitive, because the normalization classes of Arabic NSMs are random, and that the Hamza forms are usually considered a single letter by readers of the languages written in the Arabic script.

The current behavior is specially bad when a Hamza form is involved, or when two NSMs appear on one lettter (when usually one of them is a Shadda).

Examples that result in non-intuitive behavior (NFD, intuitive):

0646 064E 0651 (NOON FATHA SHADDA, NOON SHADDA FATHA): Backspace removes Shadda, while natives think about Shadda appearing before Fatha in this case, so expecting Fatha to be removed.

0627 0653 (ALEF MADDA, ALEF-MADDA): Backspace removes Madda, while natives think of Alef-Madda as a single unit (0622).

064A 0654 (YEH HAMZA, YEH-HAMZA): This is the same as the common letter YEH-HAMZA (0626). After pressing the backspace, only the HAMZA is removed, an Arabic Yeh then remains, which is unacceptable in languages like Persian and Urdu which use 0626 but not 064A.

064A 064E 0654 (YEH FATHA HAMZA, YEH-HAMZA FATHA): This may be among the worst case scenarios for Persian. A user first presses the key for Yeh-Hamza and then for Fatha, but when she backspaces, an Arabic Yeh (not used in Persian) remains with a Fatha over it.

Comment 1 Owen Taylor 2006-08-06 12:49:30 UTC

Any cases where you want the entire grapheme deleted can be fixed in
in Pango, since GTK+ honors the backspace_deletes_character Pango attribute.
If it's false it will delete the entire grapheme.

(This could potentially be done different for different languages with
a language module, or just by looking at analysis.language in 
pango_default_break().)

I don't know how to handle the first case ... the operation  needs to be unaffected by normalization, so you can't just say "delete the second mark appearing in the text sequence". It would basically require special casing.
We could add some sort of helper function in pango "delete one character from
this grapheme for this Pangolanguage" to at least centralize the operation 
to one place.

Comment 2 Behdad Esfahbod 2006-08-06 15:48:21 UTC

What we really want is for the entry to behave as if no normalization was done.  One backspace cancels the effect of one keystroke.  Why are we normalizing to NFD btw?  Isn't NFC the recommended way to encode text?

Comment 3 Roozbeh Pournader 2006-08-11 15:39:29 UTC

(In reply to comment #1)
> Any cases where you want the entire grapheme deleted can be fixed in
> in Pango, since GTK+ honors the backspace_deletes_character Pango attribute.
> If it's false it will delete the entire grapheme.

I know about the attribute, but it's not going to solve the problem. Deleting the entire grapheme fixes the second and third issues, but makes the first and fourth issues (and most of the cases I didn't mention) worst.

(In reply to comment #2)
> What we really want is for the entry to behave as if no normalization was 
> done.

Yes, in a way. In other words, we need some inituitive order defined for each script/language/layout, which defines the order of atomic elements people think their language is composed of.

>  One backspace cancels the effect of one keystroke.

Yes, but this doesn't mean much if the character sequence has come from the outside world with no information about keystroke order.

> Why are we normalizing to
> NFD btw?  Isn't NFC the recommended way to encode text?

Probably to make sure that the minimum number of characters are deleted when backspacing. IIRC from the code, the buffer becomes NFD just for the minimal deletion.

But still, changing this to NFC is going to fix the last three issues and will not cause any new problems for the Arabic script. It will not change the behaviour of grapheme-deleting scripts either. So it will be a step forward for my use cases. But I can't tell much about others, like Indic scripts.

Anyway, I can attach a patch for using NFC instead of NFD when backspacing if you want to apply it.

Comment 4 Behdad Esfahbod 2006-08-11 16:02:20 UTC

> (In reply to comment #2)
> > What we really want is for the entry to behave as if no normalization was 
> > done.
> 
> Yes, in a way. In other words, we need some inituitive order defined for each
> script/language/layout, which defines the order of atomic elements people think
> their language is composed of.

We can add new pango attributes to achieve this.

> > Why are we normalizing to
> > NFD btw?  Isn't NFC the recommended way to encode text?
> 
> Probably to make sure that the minimum number of characters are deleted when
> backspacing. IIRC from the code, the buffer becomes NFD just for the minimal
> deletion.
> 
> But still, changing this to NFC is going to fix the last three issues and will
> not cause any new problems for the Arabic script. It will not change the
> behaviour of grapheme-deleting scripts either. So it will be a step forward for
> my use cases. But I can't tell much about others, like Indic scripts.
> 
> Anyway, I can attach a patch for using NFC instead of NFD when backspacing if
> you want to apply it.

Go ahead and attach.  We can switch and see what breaks.

Comment 5 Owen Taylor 2006-08-11 16:17:25 UTC

B> What we really want is for the entry to behave as if no normalization was done.
B> One backspace cancels the effect of one keystroke.  

During text input, yes, that's generally the coolest behavior. But in general,
there is no expectation that one character in the text corresponds to one
keystroke ... once we have committed the input method text, the original
keystrokes are gone. And you also have to deal with the case where you
have existing text coming from who-knows-where. An approach that is based
only on the Unicode text 

B> Why are we normalizing to NFD btw?  Isn't NFC the recommended way to encode text?

I don't think there is a recommendation; NFC generally has better compatibility
with older software (including Pango! :-() but, for example, OS X uses NFD
for filenames. In this case:

 - The delete behavior shouldn't depend on the normalization form (you
   don't want deletion to act different for OS X filenames...)

 - *** Normalizing to NFC before deleting character-by-character is nonsense...
   because the set of precombined forms is arbitrary, historical and at this 
   point fixed. No further combining forms will be added to Unicode ***

So right now Pango offers the choice of two alternatives: deleting character
by character in NFD or deleting entire graphemes. It appears that neither
works quite right here.

R> I know about the attribute, but it's not going to solve the problem. Deleting
R> the entire grapheme fixes the second and third issues, but makes the first and
R> fourth issues (and most of the cases I didn't mention) worst.

What I'm saying is that while we set the attribute script-by-script now, we
could be more detailed for Arabic, and set it only when the user would actually
expect the entire grapheme to be deleted.

Comment 6 Behdad Esfahbod 2006-08-30 21:35:07 UTC

We need to write an Arabic lang engine then.  Should be interesting to have in-tree lang modules.

Comment 7 Behdad Esfahbod 2006-09-18 21:28:14 UTC

Ok, after fixing quite a few bugs in he language engine infrastructure, we are ready to host in-tree lang engines, and I already have a draft Arabic one.

Comment 8 Behdad Esfahbod 2006-09-18 22:12:30 UTC

2006-09-18  Behdad Esfahbod  <behdad@gnome.org>

        Part of Bug 350132 – backspacing doesn't work properly for Arabic

        * configure.in:
        * modules/arabic/Makefile.am:
        * modules/arabic/arabic-lang.c:
        Add a simple Arabic language engine.  Currently it just makes sure
        that backspace_deletes_character is not set on ALEF-MADDA
        combinations.



This solves the second problem listed in the original report.  The third can be solved by adding more combinations to the current code.  Waiting for patches.  Roozbeh?

Comment 9 Roozbeh Pournader 2006-11-05 17:12:40 UTC

Created attachment 76036 [details] [review]
patch to handle the rest of the trivial cases

With the attached patch, the following characters are taken care of:
0622;ARABIC LETTER ALEF WITH MADDA ABOVE;0627 0653
0623;ARABIC LETTER ALEF WITH HAMZA ABOVE;0627 0654
0624;ARABIC LETTER WAW WITH HAMZA ABOVE;0648 0654
0625;ARABIC LETTER ALEF WITH HAMZA BELOW;0627 0655
0626;ARABIC LETTER YEH WITH HAMZA ABOVE;064A 0654

There are three more remaining Arabic characters with standard decompositions, but are explicitly mentioned as being ligatures in UCD's NamesList.txt

So back to the harder cases...

Comment 10 Roozbeh Pournader 2006-11-05 17:22:43 UTC

Created attachment 76037 [details] [review]
fixed a typo with the previous patch

Comment 11 Roozbeh Pournader 2006-11-05 17:24:52 UTC

Created attachment 76038 [details] [review]
fixes a typo with the previous patch

apparently I mistakenly re-attached the older patch first time.

Comment 12 Behdad Esfahbod 2006-11-06 02:03:01 UTC

Humm, Owen and I decided that for correct backspacing, it's easiest to add new lang-engine API to do exactly that.

Comment 13 Roozbeh Pournader 2006-11-07 11:54:54 UTC

Created attachment 76146 [details] [review]
new patch incorporating suggestions made by Behdad on IRC

Please review.

Comment 14 Roozbeh Pournader 2006-11-07 12:41:35 UTC

Trying to document a simple system that should be fine for all users of Arabic script. The following is a list of common NFD forms and what should be done about them. I am not listing the cases where just removing the last character in NFD (present behavior) is fine, those which are fixed with the last patch, or obscure cases:

{FATHATAN..KASRA|SHADDA|SUKUN|SUPERSCRIPT_ALEF} HAMZA_ABOVE: keep the Hamza, delete the other diacritic

{FATHATAN..KASRA} SHADDA: keep the Shadda, delete the the other diacritic

{FATHATAN..KASRA} SHADDA HAMZA_ABOVE: keep Shadda and Hamza, delete the other diacritic

SHADDA SUPERSCRIPT_ALEF HAMZA_ABOVE: delete Superscript Alef

KASRA HAMZA_BELOW: keep the Hamza, delete the Kasra

Some cases, which may raise due to typos, are hard to decide. One example is ALEF-MADDA FATHA/ALEF FATHA MADDAH. In these cases, only the order of data entry is important and when it doesn't exist, any behavior may be fine.

Comment 15 Behdad Esfahbod 2006-11-07 16:50:54 UTC

(In reply to comment #14)
> Trying to document a simple system that should be fine for all users of Arabic
> script. The following is a list of common NFD forms and what should be done
> about them. I am not listing the cases where just removing the last character
> in NFD (present behavior) is fine, those which are fixed with the last patch,
> or obscure cases:
> 
> {FATHATAN..KASRA|SHADDA|SUKUN|SUPERSCRIPT_ALEF} HAMZA_ABOVE: keep the Hamza,
> delete the other diacritic
> 
> {FATHATAN..KASRA} SHADDA: keep the Shadda, delete the the other diacritic
> 
> {FATHATAN..KASRA} SHADDA HAMZA_ABOVE: keep Shadda and Hamza, delete the other
> diacritic
> 
> SHADDA SUPERSCRIPT_ALEF HAMZA_ABOVE: delete Superscript Alef
> 
> KASRA HAMZA_BELOW: keep the Hamza, delete the Kasra
> 
> Some cases, which may raise due to typos, are hard to decide. One example is
> ALEF-MADDA FATHA/ALEF FATHA MADDAH. In these cases, only the order of data
> entry is important and when it doesn't exist, any behavior may be fine.

What I had in mind is:

- Remove all FATHATAN..KASRA|SUKUN.  If anything removed, break.
- Remove SUPERSCRIPT_ALEF.  If anything removed, break.
- Remove SHADDA.  If anything removed, break.
- Remove all HAMZA_ABOVE HAMZA_BELOW.  If anything removed, break.
- Remove the entire cluster.

Comment 16 Behdad Esfahbod 2006-11-07 17:07:07 UTC

(In reply to comment #13)
> Created an attachment (id=76146) [edit]
> new patch incorporating suggestions made by Behdad on IRC
> 
> Please review.

Looks good.  Go ahead and commit.  As for Azeri or other languages, you can condition on analysis->language passed in.   It's not working right now (you get NULL), but I can easily fix that.

Comment 17 Roozbeh Pournader 2006-11-08 13:03:02 UTC

(In reply to comment #15)
> What I had in mind is:
> 
> - Remove all FATHATAN..KASRA|SUKUN.  If anything removed, break.
> - Remove SUPERSCRIPT_ALEF.  If anything removed, break.
> - Remove SHADDA.  If anything removed, break.
> - Remove all HAMZA_ABOVE HAMZA_BELOW.  If anything removed, break.
> - Remove the entire cluster.

There are a lot of Arabic combining marks there now, most of them being in combining classes of 220 and 230, which means the order of them is important. We may need to remove some of those (but not all, as combining Hamzas are also in the same classes) first, that's why I'm somehow sticking to the NFD way. If you want to do it the way you are proposing, you need to understand all the others and find how they are used. There are many complicated cases occuring in Koran, like ones with a Fatha, a Superscript Alef, and a Madda on the same base letter (which I guess may also be a Hamza form).

Also, Superscript Alef should very probably be deleted before Fatha, as some Koranic usage has a base lettter with both of the marks, and as the Superscript Alef somehow represents an Alef that should have come after the letter, it is considered later by users.

So, for simpler cases your suggested behavior is somehow the same as mine except that some cases are not handled by yours (and some are not by mine) and that you are removing Superscript Alef later rather than earlier.

(In reply to comment #16)
> As for Azeri or other languages, you can
> condition on analysis->language passed in.   It's not working right now (you
> get NULL), but I can easily fix that.

Well, as we are not aware of the details of any Arabic Azerbaijani keyboard layout, we don't really know how they really enter the characters or expect the backspace to work. Although the combinations are considered one letter, they may as well be entered with two keystrokes and expect them to be deleted one-by-one. I just wanted to put the documentation there for the next time we visit with more info.

Comment 18 Pravin Satpute 2009-10-24 10:27:48 UTC

hi roozeb and behdad
i was trying to implement same thing in indic-lang.c, there are 20 something characters are there

while testing for arabic i found following problem

if we input U+0623 أ  and do backspace whole word goes that correct

but even if we input U+0627 ا and U+0654 ٔ  and press backspace both characters goes and one backspace even, thats wrong i think

please update me if i am wrong 

thanks

Comment 19 Pravin Satpute 2009-10-24 10:48:00 UTC

Created attachment 146164 [details] [review]
patch for handling indic NFC

attaching here just for review, since already same kind of bug 

but somehow its not working for split matras (IS_SPLIT_MATRA_BRAHMI), since

(0995 + 09cb ) after NFC it becomes (09c7 + 0995+ 09be)  
and single backspace key deletes all(0995 + 09cb) :(

Comment 20 Behdad Esfahbod 2009-10-26 21:31:24 UTC

(In reply to comment #19)
> Created an attachment (id=146164) [details] [review]
> patch for handling indic NFC
> 
> attaching here just for review, since already same kind of bug 
> 
> but somehow its not working for split matras (IS_SPLIT_MATRA_BRAHMI), since
> 
> (0995 + 09cb ) after NFC it becomes (09c7 + 0995+ 09be)  
> and single backspace key deletes all(0995 + 09cb) :(

GAH, PLEASE FILE A NEW BUG.  THIS BUG IS ABOUT ARABIC ONLY.

Comment 21 Pravin Satpute 2009-10-27 05:05:55 UTC

ok, i will update patch on respective bug
did you saw mine comment #18
is that expected behaviour for Arabic?

Comment 22 GNOME Infrastructure Team 2018-05-22 12:21:37 UTC

-- GitLab Migration Automatic Message --

This bug has been migrated to GNOME's GitLab instance and has been closed from further activity.

You can subscribe and participate further through the new bug through this link to our GitLab instance: https://gitlab.gnome.org/GNOME/pango/issues/55.