GNOME Bugzilla – Bug 350132
backspacing doesn't work properly for Arabic
Last modified: 2018-05-22 12:21:37 UTC
Backspacing doesn't work as expected for the Arabic script. Currently, the functions gtk_entry_backspace and gtk_text_buffer_backspace normalize the string to NFD and then remove the last character from the string. This is not intuitive, because the normalization classes of Arabic NSMs are random, and that the Hamza forms are usually considered a single letter by readers of the languages written in the Arabic script. The current behavior is specially bad when a Hamza form is involved, or when two NSMs appear on one lettter (when usually one of them is a Shadda). Examples that result in non-intuitive behavior (NFD, intuitive): 0646 064E 0651 (NOON FATHA SHADDA, NOON SHADDA FATHA): Backspace removes Shadda, while natives think about Shadda appearing before Fatha in this case, so expecting Fatha to be removed. 0627 0653 (ALEF MADDA, ALEF-MADDA): Backspace removes Madda, while natives think of Alef-Madda as a single unit (0622). 064A 0654 (YEH HAMZA, YEH-HAMZA): This is the same as the common letter YEH-HAMZA (0626). After pressing the backspace, only the HAMZA is removed, an Arabic Yeh then remains, which is unacceptable in languages like Persian and Urdu which use 0626 but not 064A. 064A 064E 0654 (YEH FATHA HAMZA, YEH-HAMZA FATHA): This may be among the worst case scenarios for Persian. A user first presses the key for Yeh-Hamza and then for Fatha, but when she backspaces, an Arabic Yeh (not used in Persian) remains with a Fatha over it.
Any cases where you want the entire grapheme deleted can be fixed in in Pango, since GTK+ honors the backspace_deletes_character Pango attribute. If it's false it will delete the entire grapheme. (This could potentially be done different for different languages with a language module, or just by looking at analysis.language in pango_default_break().) I don't know how to handle the first case ... the operation needs to be unaffected by normalization, so you can't just say "delete the second mark appearing in the text sequence". It would basically require special casing. We could add some sort of helper function in pango "delete one character from this grapheme for this Pangolanguage" to at least centralize the operation to one place.
What we really want is for the entry to behave as if no normalization was done. One backspace cancels the effect of one keystroke. Why are we normalizing to NFD btw? Isn't NFC the recommended way to encode text?
(In reply to comment #1) > Any cases where you want the entire grapheme deleted can be fixed in > in Pango, since GTK+ honors the backspace_deletes_character Pango attribute. > If it's false it will delete the entire grapheme. I know about the attribute, but it's not going to solve the problem. Deleting the entire grapheme fixes the second and third issues, but makes the first and fourth issues (and most of the cases I didn't mention) worst. (In reply to comment #2) > What we really want is for the entry to behave as if no normalization was > done. Yes, in a way. In other words, we need some inituitive order defined for each script/language/layout, which defines the order of atomic elements people think their language is composed of. > One backspace cancels the effect of one keystroke. Yes, but this doesn't mean much if the character sequence has come from the outside world with no information about keystroke order. > Why are we normalizing to > NFD btw? Isn't NFC the recommended way to encode text? Probably to make sure that the minimum number of characters are deleted when backspacing. IIRC from the code, the buffer becomes NFD just for the minimal deletion. But still, changing this to NFC is going to fix the last three issues and will not cause any new problems for the Arabic script. It will not change the behaviour of grapheme-deleting scripts either. So it will be a step forward for my use cases. But I can't tell much about others, like Indic scripts. Anyway, I can attach a patch for using NFC instead of NFD when backspacing if you want to apply it.
> (In reply to comment #2) > > What we really want is for the entry to behave as if no normalization was > > done. > > Yes, in a way. In other words, we need some inituitive order defined for each > script/language/layout, which defines the order of atomic elements people think > their language is composed of. We can add new pango attributes to achieve this. > > Why are we normalizing to > > NFD btw? Isn't NFC the recommended way to encode text? > > Probably to make sure that the minimum number of characters are deleted when > backspacing. IIRC from the code, the buffer becomes NFD just for the minimal > deletion. > > But still, changing this to NFC is going to fix the last three issues and will > not cause any new problems for the Arabic script. It will not change the > behaviour of grapheme-deleting scripts either. So it will be a step forward for > my use cases. But I can't tell much about others, like Indic scripts. > > Anyway, I can attach a patch for using NFC instead of NFD when backspacing if > you want to apply it. Go ahead and attach. We can switch and see what breaks.
B> What we really want is for the entry to behave as if no normalization was done. B> One backspace cancels the effect of one keystroke. During text input, yes, that's generally the coolest behavior. But in general, there is no expectation that one character in the text corresponds to one keystroke ... once we have committed the input method text, the original keystrokes are gone. And you also have to deal with the case where you have existing text coming from who-knows-where. An approach that is based only on the Unicode text B> Why are we normalizing to NFD btw? Isn't NFC the recommended way to encode text? I don't think there is a recommendation; NFC generally has better compatibility with older software (including Pango! :-() but, for example, OS X uses NFD for filenames. In this case: - The delete behavior shouldn't depend on the normalization form (you don't want deletion to act different for OS X filenames...) - *** Normalizing to NFC before deleting character-by-character is nonsense... because the set of precombined forms is arbitrary, historical and at this point fixed. No further combining forms will be added to Unicode *** So right now Pango offers the choice of two alternatives: deleting character by character in NFD or deleting entire graphemes. It appears that neither works quite right here. R> I know about the attribute, but it's not going to solve the problem. Deleting R> the entire grapheme fixes the second and third issues, but makes the first and R> fourth issues (and most of the cases I didn't mention) worst. What I'm saying is that while we set the attribute script-by-script now, we could be more detailed for Arabic, and set it only when the user would actually expect the entire grapheme to be deleted.
We need to write an Arabic lang engine then. Should be interesting to have in-tree lang modules.
Ok, after fixing quite a few bugs in he language engine infrastructure, we are ready to host in-tree lang engines, and I already have a draft Arabic one.
2006-09-18 Behdad Esfahbod <behdad@gnome.org> Part of Bug 350132 – backspacing doesn't work properly for Arabic * configure.in: * modules/arabic/Makefile.am: * modules/arabic/arabic-lang.c: Add a simple Arabic language engine. Currently it just makes sure that backspace_deletes_character is not set on ALEF-MADDA combinations. This solves the second problem listed in the original report. The third can be solved by adding more combinations to the current code. Waiting for patches. Roozbeh?
Created attachment 76036 [details] [review] patch to handle the rest of the trivial cases With the attached patch, the following characters are taken care of: 0622;ARABIC LETTER ALEF WITH MADDA ABOVE;0627 0653 0623;ARABIC LETTER ALEF WITH HAMZA ABOVE;0627 0654 0624;ARABIC LETTER WAW WITH HAMZA ABOVE;0648 0654 0625;ARABIC LETTER ALEF WITH HAMZA BELOW;0627 0655 0626;ARABIC LETTER YEH WITH HAMZA ABOVE;064A 0654 There are three more remaining Arabic characters with standard decompositions, but are explicitly mentioned as being ligatures in UCD's NamesList.txt So back to the harder cases...
Created attachment 76037 [details] [review] fixed a typo with the previous patch
Created attachment 76038 [details] [review] fixes a typo with the previous patch apparently I mistakenly re-attached the older patch first time.
Humm, Owen and I decided that for correct backspacing, it's easiest to add new lang-engine API to do exactly that.
Created attachment 76146 [details] [review] new patch incorporating suggestions made by Behdad on IRC Please review.
Trying to document a simple system that should be fine for all users of Arabic script. The following is a list of common NFD forms and what should be done about them. I am not listing the cases where just removing the last character in NFD (present behavior) is fine, those which are fixed with the last patch, or obscure cases: {FATHATAN..KASRA|SHADDA|SUKUN|SUPERSCRIPT_ALEF} HAMZA_ABOVE: keep the Hamza, delete the other diacritic {FATHATAN..KASRA} SHADDA: keep the Shadda, delete the the other diacritic {FATHATAN..KASRA} SHADDA HAMZA_ABOVE: keep Shadda and Hamza, delete the other diacritic SHADDA SUPERSCRIPT_ALEF HAMZA_ABOVE: delete Superscript Alef KASRA HAMZA_BELOW: keep the Hamza, delete the Kasra Some cases, which may raise due to typos, are hard to decide. One example is ALEF-MADDA FATHA/ALEF FATHA MADDAH. In these cases, only the order of data entry is important and when it doesn't exist, any behavior may be fine.
(In reply to comment #14) > Trying to document a simple system that should be fine for all users of Arabic > script. The following is a list of common NFD forms and what should be done > about them. I am not listing the cases where just removing the last character > in NFD (present behavior) is fine, those which are fixed with the last patch, > or obscure cases: > > {FATHATAN..KASRA|SHADDA|SUKUN|SUPERSCRIPT_ALEF} HAMZA_ABOVE: keep the Hamza, > delete the other diacritic > > {FATHATAN..KASRA} SHADDA: keep the Shadda, delete the the other diacritic > > {FATHATAN..KASRA} SHADDA HAMZA_ABOVE: keep Shadda and Hamza, delete the other > diacritic > > SHADDA SUPERSCRIPT_ALEF HAMZA_ABOVE: delete Superscript Alef > > KASRA HAMZA_BELOW: keep the Hamza, delete the Kasra > > Some cases, which may raise due to typos, are hard to decide. One example is > ALEF-MADDA FATHA/ALEF FATHA MADDAH. In these cases, only the order of data > entry is important and when it doesn't exist, any behavior may be fine. What I had in mind is: - Remove all FATHATAN..KASRA|SUKUN. If anything removed, break. - Remove SUPERSCRIPT_ALEF. If anything removed, break. - Remove SHADDA. If anything removed, break. - Remove all HAMZA_ABOVE HAMZA_BELOW. If anything removed, break. - Remove the entire cluster.
(In reply to comment #13) > Created an attachment (id=76146) [edit] > new patch incorporating suggestions made by Behdad on IRC > > Please review. Looks good. Go ahead and commit. As for Azeri or other languages, you can condition on analysis->language passed in. It's not working right now (you get NULL), but I can easily fix that.
(In reply to comment #15) > What I had in mind is: > > - Remove all FATHATAN..KASRA|SUKUN. If anything removed, break. > - Remove SUPERSCRIPT_ALEF. If anything removed, break. > - Remove SHADDA. If anything removed, break. > - Remove all HAMZA_ABOVE HAMZA_BELOW. If anything removed, break. > - Remove the entire cluster. There are a lot of Arabic combining marks there now, most of them being in combining classes of 220 and 230, which means the order of them is important. We may need to remove some of those (but not all, as combining Hamzas are also in the same classes) first, that's why I'm somehow sticking to the NFD way. If you want to do it the way you are proposing, you need to understand all the others and find how they are used. There are many complicated cases occuring in Koran, like ones with a Fatha, a Superscript Alef, and a Madda on the same base letter (which I guess may also be a Hamza form). Also, Superscript Alef should very probably be deleted before Fatha, as some Koranic usage has a base lettter with both of the marks, and as the Superscript Alef somehow represents an Alef that should have come after the letter, it is considered later by users. So, for simpler cases your suggested behavior is somehow the same as mine except that some cases are not handled by yours (and some are not by mine) and that you are removing Superscript Alef later rather than earlier. (In reply to comment #16) > As for Azeri or other languages, you can > condition on analysis->language passed in. It's not working right now (you > get NULL), but I can easily fix that. Well, as we are not aware of the details of any Arabic Azerbaijani keyboard layout, we don't really know how they really enter the characters or expect the backspace to work. Although the combinations are considered one letter, they may as well be entered with two keystrokes and expect them to be deleted one-by-one. I just wanted to put the documentation there for the next time we visit with more info.
hi roozeb and behdad i was trying to implement same thing in indic-lang.c, there are 20 something characters are there while testing for arabic i found following problem if we input U+0623 أ and do backspace whole word goes that correct but even if we input U+0627 ا and U+0654 ٔ and press backspace both characters goes and one backspace even, thats wrong i think please update me if i am wrong thanks
Created attachment 146164 [details] [review] patch for handling indic NFC attaching here just for review, since already same kind of bug but somehow its not working for split matras (IS_SPLIT_MATRA_BRAHMI), since (0995 + 09cb ) after NFC it becomes (09c7 + 0995+ 09be) and single backspace key deletes all(0995 + 09cb) :(
(In reply to comment #19) > Created an attachment (id=146164) [details] [review] > patch for handling indic NFC > > attaching here just for review, since already same kind of bug > > but somehow its not working for split matras (IS_SPLIT_MATRA_BRAHMI), since > > (0995 + 09cb ) after NFC it becomes (09c7 + 0995+ 09be) > and single backspace key deletes all(0995 + 09cb) :( GAH, PLEASE FILE A NEW BUG. THIS BUG IS ABOUT ARABIC ONLY.
ok, i will update patch on respective bug did you saw mine comment #18 is that expected behaviour for Arabic?
-- GitLab Migration Automatic Message -- This bug has been migrated to GNOME's GitLab instance and has been closed from further activity. You can subscribe and participate further through the new bug through this link to our GitLab instance: https://gitlab.gnome.org/GNOME/pango/issues/55.