GNOME Bugzilla – Bug 59388
g_ascii_is_space etc.
Last modified: 2011-02-18 15:47:37 UTC
When implementing some stuff i used isascii(). Owen told me to use g_ascii_is_space(), which doesn't exist. Do we want to add this function? What about the other g_ascii_is_XXX that aren't currently implemented. Some discussion here: http://lists.gnome.org/archives/gtk-devel-list/2001-August/msg00207.html
I think we should go ahead and add all of them. It isn't even obvious to me what ones of these are local dependent and asking programmers with little experience in i18n to guess seems like asking too much. (and also because we have (char c) for the g_ascii_* functions instead of the sign-cast-bug-inviting signature for the standard functions.) The full set of missing functions are: int iscntrl(int); int isdigit(int); int isgraph(int); int isprint(int); int ispunct(int); int isspace(int); int isxdigit(int); I would suggest making these and also the ones we already have macros indexing into a constant array of flag fields, since performance does matter here. But care should be taken with sign problems - you probably want to do something like array[((unsigned int)(c)) & 0xff] to get the same effect as the current isascii (char c).
Alex, let me know if you want me to write the additional functions, or if you'd like to do them yourself.
Created attachment 948 [details] Perl program for producing table of types
I've attached a perl program that may be useful if implementing these functions as macros as indicated above - it produces a 512-byte table that can be used as above for macros without multiple evaluation. I've checked over the table some and believe it is correct, but it might be worth uncommenting the commented section and verifying.
The program above treats the vertical tab character (\x0b) as isspace/isprint, to correspond to the handling for the C locale on Linux. On second thought, I believe this is wrong, since the vertical tab is univerally not used, and people won't expect it to get passed by these functions and won't be able to handle it. Also, \r\n\t\f\v are treated as g_unichar_isspace() but not g_unichar_isprint(), and the script above treats isspace as a proper subset of isprint. I think we need to as a testcase assert that g_ascii_isXXX(c) <=> 0 <= c < 128 && g_unicode_isXXXX(c)
I was thinking that the test code should run through all the cases and check that the results match the "C" locale results from the is* calls, with explicit exceptions for the cases we want to handle differently, and also run the corresponding g_unicode_is* calls.
I wrote the test code already and I'm working on a patch.
I am now testing the code. There was just a misunderstanding about isprint. It's supposed to include the space character, but not other white-space characters. I fixed the script to match the C spec. on that one. I also left the vertical tab character in there. While I agree that it's not so useful to call it a space character, I also don't think it makes sense to change g_ascii_isspace to not include it if g_unichar_isspace and isspace ("C" locale) both include it. If you want me to change it, that's OK. I also had to add 0x7F to iscntrl. Anyway, all the functions now match -- testing g_ascii_isXXX vs. isXXX and g_unichar_isXXX except for one case: g_unichar_is_punct. g_unichar_is_punct returns FALSE for: DOLLAR SIGN PLUS SIGN LESS-THAN SIGN EQUALS SIGN GREATER-THAN SIGN SPACING CIRCUMFLEX SPACING GRAVE VERTICAL BAR TILDE Is this a bug in g_unichar_is_punct?
On further inspection, I think that g_unichar_ispunct should include symbols, not just punctuation.
Created attachment 956 [details] [review] patch to add these calls and fix g_unichar anomaly too
Created attachment 957 [details] [review] previous patch accidentally moved a comment away from the includes section
Looks in general good. I'm a little uncertain about including the checks against the C library in the test suite because the C standard is a little vague about the exact required behavior for the C locale ... I think it _probably_ can be taken as an exact specification, but am a little sure around the edges for such things as whether DEL is a control character or not. I'd actually prefer to take the "what works best for the user" approach rather than the "exactly the same as the standard C functions". In particular, I'll stick to my guns and say that \v should not count as a space. And I think it _is_ worth implementing the is* functions as macros. That's the main reason I suggested the table ... I doubt that as functions, the table significantly improves speed, but it allows writing macros easily. I don't think there are any real downsides to macros except for possibly more confusing error reporting for the user. (The C library functions are already macros, however.) In terms of docs, the right thing to do is to add the prototypes to docs/refernences/glib/glib-overrides.txt so they get documented as function not macros. I think you can actually put the doc comments into the header files along with the macros now (though I haven't tried that in this circumstance), or they can go in tmpl/. I agree with the change to g_unichar_ispunct(), though it isa little odd for it to be returning TRUE for things which aren't punctuation in unicode terms, consistency with what the C library ispunct() means is probably more important. [ who uses ispunct() anyways? In fact, who uses isgraph(), isprint()?.. in fact, most uses of isalnum, etc, are also broken ... you should be using the pango logical atributes functionality to find text boundaries... but anyways... ]
I already "bent to your will" on the '\v' thing, as you'll see in my patch. I changed both g_ascii_isspace and g_unichar_isspace. The C library functions are pairs of macros and matching functions. Making the GLib functions be macros seems overkill to me. You're welcome to do it, but I won't do that part myself. I understand that very-high-efficiency versions of these functions can be help performance, but I guess that sort of thing is not really my cup of tea. I agree that many of the <ctype.h> functions, like ispunct, are pretty useless, and it's almost silly for us to be making the ASCII and Unicode versions of them.
Created attachment 969 [details] [review] Same patch, now using macros
I attached a new patch that implements the functions as macros. I moved the docs to tmpl/ I had to change the table for vertical tab though, now it is !space and cntrl.
I think you need to have parentheses around the c in the macros. Here's how I would have done one: #define g_ascii_isalnum(c) \ ((g_ascii_table[(guchar) (c)] & G_ASCII_ALNUM) != 0)
Checked in.