GNOME Bugzilla – Bug 54166
glib should have non-locale-sensitive islower/isupper/tolower/toupper/strcasecmp/strncasecmp
Last modified: 2011-02-18 15:47:37 UTC
It's almost never useful to use the locale-specific versions of these functions. But the non-locale-specific ones that handle only ASCII "A-Z" and "a-z" are often useful. I've taken a quick look through sources on GNOME CVS and find that nearly all the users of these calls would be better off with a non-locale-specific version.
Created attachment 518 [details] [review] first cut at a patch to do this
Additional things that I could do to the patch: 1) add documentation -- I couldn't find a good example inside glib to copy, or I would have included documentation with the patch. 2) make the char functions inline 3) optimize the char functions in other ways 4) preserve the (in my opinion useless) locale-specificity of g_strcasecmp and g_strncasecmp and come up with new names for the non-locale-specific functions Note that the functions take gchar instead of taking int the way the standard islower/isupper/tolower/toupper functions do. If there's some reason this is a bad choice, I can change it back.
Some action would probably be good here. Thoughts: - any use of islower(), etc. that assumes locale-sensitive is broken, since it won't work in multibyte locales - ergo islower() etc. are only useful if ASCII-only - however having functions with the same name as the C version that are a bit different is strange, though perhaps not unprecedented in GLib - should g_strdown() g_strup() be ASCII-only or UTF-8-only? - is it OK to incompatibly change g_strdown(), etc.? - how about g_ascii_* for names? Just some random thoughts, this needs thinking through. Some use-cases would be helpful, taking into account which are covered by g_unicode_* and which still are not.
Definitely need to be g_ascii_* I think I'd deprecate g_strcasecmp, g_strdown, g_strup instead of changing them. If they don't have any encoding specified, then they need to handle Unicode, but I think changing existing functions to act on UTF-8 is dubious, so I'd rather introduce new g_utf8_strcasecmp, etc. (Jon Trowbridge was looking at this.)
Created attachment 613 [details] [review] patch that uses ascii prefix and deprecates locale-specific versions
Also included in my patch are two places that Owen missed when he added isupper guards to tolower calls (and vice versa).
Looking more closely at it, it seems that instead of using g_ascii_strdown (or g_strdown for that matter), g_date_prepare_to_parse should instead call g_locale_to_utf8 and then g_utf8_strdown. And g_date_fill_parse_tokens should use g_utf8_strdown instead of g_ascii_strdown and use g_utf8_strstr instead of strstr. But we don't have g_utf8_strdown or g_utf8_strstr yet. Maybe Jon Trowbridge wants to work on those. Maybe we need some new bug reports about that part?
Created attachment 615 [details] [review] improved patch with documentation for all new functions
Created attachment 618 [details] [review] patch with fixed documentation for strncasecmp
I wrote incorrect documentation for g_ascii_strncasecmp, parroting what the man pages for strncmp and strncasecmp say. Then I realized that the man page is wrong and instead documented how it really works. After looking at the C standard I'm still not sure whether the count "n" is supposed to limit the number of characters in "s1" compared (as the man page says) or limit the number of characters in either stirng compared. But all the implementations say both, so I made the comment say both. For example, both the man page and standard seem to say that strncmp ("X", "XY", 1) should return -1, but we all know that it returns 0 in real life. Anyway, I fixed the documentation to match reality in the 4th cut at the patch that I attached earlier.
Hmm, two thoughts about making g_ascii_* compatible with Unicode versions: - I think g_ascii_islower(), etc, probably need to take a gunichar. Yes, that reintroduces the typical pitfall of: g_ascii_islower(((guchar)c[i])) But I think that's better than c < 255 ? g_ascii_islower (c) : FALSE; Since asking whether a unicode character is a lowercase ascii is a reasonable operation too. One possibility would be too take int and interpret -128:-1 as being high ascii, but I think that's probably just too gross. - I suspect that g_ascii_strup(), g_ascii_strdown() should dup the string, since corresponding g_utf8_* functions should do so. And, yes: - The date functions indeed need to be changed to covert from locale / unicode normalize and/or casefold rather than g_ascii_strup(), g_ascii_strdown(). I'm not sure that the change to g_ascii_* is an improvement - it is broken either way.
Changing g_ascii_islower to take a gunichar? Seems like a poor idea. While there might be some use for g_unichar_is_ascii_lowercase, I don't imagine there are going to be a significant number of places where we'd need it. But there is ASCII text in various places, stored in gchar and gchar* I don't think we should let the existence of gunichar prevent us from making these useful gchar functions. Making them less convenient for the callers who really need them just so they can be used on gunichar, which is rarely going to be right anyway, seems wrong to me. But I don't think it's worth debating much. Either way, getting this in soon (not at the last possible moment) so we can get rid of the mistaken locale dependencies in the various libraries for GNOME 2 seems the priority. Changing strup/strdown to dup the string? Seems OK to me, but unnecessary. If it was my decision, I would choose to dup the string, but I prefer that kind of interface anyway (as you can tell if you look at eel). I think the change to the ASCII versions is a slight improvement even for the date functions, but it's far less important than the rest of the changes.
Applied, with the strdup'ing change, but not the gunichar change. Sat Jun 30 12:49:26 2001 Owen Taylor <otaylor@redhat.com> Patch from Darin Adler (#54166) * glib/gstrfuncs.[ch]: Add ascii-only, locale-insensitive g_ascii_to[lower/upper], g_ascii_str[down/up], g_ascii_is[upper/lower] and deprecate the locale-affected versions which break for UTF-8, etc. Make g_ascii_strup/strdown duplicating, not in-place for consistency with UTF-8 functions. * glib/gstring.[ch]: Add ascii-only, locale-insensitive g_string_ascii_[down/up], and deprecate the locale-affected versions which break for UTF-8, etc. * glib/gutils.c glib/gwin32.c test/testglib.c: Use the g_ascii_* functions where appropriate.