GNOME Bugzilla – Bug 612591
[patch] Please review g_string_ref() and g_string_unref() API additions
Last modified: 2017-11-16 10:38:35 UTC
Created attachment 155872 [details] [review] Initial patch for review In my various projects, I use refcounted string objects, mostly to save memory and also for to avoid over-zealous strdup'ing. Most projects won't need to do this, but when you have 10 copies of 10,000 strings (typically 100 bytes each) then the refcounting functionality becomes really useful. In the attached patch I've added this functionality to GString, copying the logic from the recent GPtrArray _ref and _unref API additions. Please review. Thanks!
Review of attachment 155872 [details] [review]: Since GString is a mutable type, it seems like a bit of a trap to use it for structural string sharing. It's also just not anywhere as efficient as a dedicated data type could be. For example with a dedicated API you can just pack a refcount in front of the string data in one malloc block. (No objections though to the patch - it may be useful for some other cases to have refcounts on strings).
I have such a packed refcounted string in eel: http://git.gnome.org/browse/nautilus/tree/eel/eel-string.h http://git.gnome.org/browse/nautilus/tree/eel/eel-string.c (look for eel_ref_str) It supports both "normal" refcounted strings and "unique strings" which are internalized in a hash table while there is a reference outstanding to it. All threadsafe too.
eel_ref_str is a very simple and useful, so i wouldn't mind if it was in glib actually.
What about extending the quark API to be refcounted?
Thats not ABI compatible. quarks are expected to get the same value next time even if at some point inbetween there was no living string referencing it.
I use the unique ref_strs for things like mimetypes and usernames/realnames in nautilus files. These are massively duplicated, but are freeform strings so any static enum would not work. Also, this makes comparison of such strings fast (just compare pointer bits).
Well, you could make it so that if a GQuark was ever created for a given string, its refcount would go to 0xFFFF which would act as a flag meaning "infinite".
(In reply to comment #3) > eel_ref_str is a very simple and useful, so i wouldn't mind if it was in glib > actually. Yes, that would suit my application too. It looks like a few projects now code their own thing (mine included) and I'm sure they are not all as threadsafe as your code. +1 from me.
We don't need any refcounts on quarks. The are immortal. If your purpose is sharing unique strings, just use g_intern_static_string or g_intern_string. Just make sure to not use the static version from an (un)loadable module. I can still see that it might be useful to make our string buffer (ie GString) refcounted.
Review of attachment 155872 [details] [review]: ::: glib/gstring.c @@ +101,3 @@ + volatile gint ref_count; +}; + Why not just add the ref_count field to GString ? It is not like we are supporting stack-allocated GStrings....
Matthias: I don't use g_intern_string, because thes are not 'internal' strings, but things based on user input, so the set of strings is unbounded and changes over time. It would not be nice if these things were immortal. Instead my unique strings are kept in the unique hash only while there is at least one reference to the string. Also, the refcounted strings are not only for interned strings. The non-interned eel_ref_str:s in nautilus is used for e.g. the File names. Each file object has a few different types of names (display name, file name, edit name). These are all eel_ref_str:s, and typically (except in weird cases) all point to the same ref-string (not unique in general, we just compare when setting these names).
Alex: Ah, I see. That is a valid case. > Well, you could make it so that if a GQuark was ever created for a given > string, its refcount would go to 0xFFFF which would act as a flag meaning > "infinite". I think this might work out. We'd have to keep a separate hash table for refed strings that are not quarks. What would that api look like ? const gchar *g_ref_string (const gchar *string) const gchar *g_unref_string (const gchar *string) The return values would be guaranteed to be == comparable with each other an with interned strings, but while interned strings live forever, these only live as long as there's a reference. Need to be a bit careful to get sequences like this right: a = g_ref_string ("a"); b = g_intern_string ("a"); // probably waste the memory for the refcount here if (a != b) g_print ("fail!\n"); g_unref_string (a); Would be great if somebody could try to implement that.
Created attachment 159455 [details] [review] GRefString, an immutable thread-safe refcounted string type I did a quick port of the eel code to glib to show what i mean.
As per comment 12. That "fail" assert won't hit. I don't actually put non-interned strings in the hashtable, they are just normal refcounted objects. Space in the hashtable is only made for interned strings.
Oh, but I was talking about the exiting g_intern_string api that is quark-based. What I was proposing is to arrange things so that g_ref_string ("a") returns the string from the quark string table if there is one, and that g_intern_string would look for an existing copy in your table of ref'ed strings before duping the string that it got.
Hmm, GRefString is basically a const char *, but its an opaque type that you access via a macro (it just has a gint refcount stored at offset -4). It seems risky to have an api that takes "char *" types that are not "normal" c strings. Lots of chances for confusion and wrong kinds of strings ending up in this api. However, sharing the quark table with the refcounted interned strings table seems like a good idea in general.
(In reply to comment #10) > Why not just add the ref_count field to GString ? > It is not like we are supporting stack-allocated GStrings.... Otherwise ref_count becomes visible (and settable) rather than an internal implementation detail. Richard.
Richard: Its not like we support the app randomly setting the other fields of GString.
Created attachment 159623 [details] [review] new patch with comments addressed from review (In reply to comment #18) > Richard: Its not like we support the app randomly setting the other fields of > GString. I guess, it just seemed odd to put a volatile int into a public header. New patch attached for review.
https://bugzilla.gnome.org/show_bug.cgi?id=622721 contains a patch with a refcounted string that overloads C strings.
It's 6 years later, and now gnome-software needs a refcounted string object. Given we have g_ptr_array_ref() can't we have g_string_ref() as well? Wrapping GString into a GsString means I have to cast everything in an ugly way, add this to the gnome-software plugin API and deal with all the gobject-introspection junk. Having g_string_ref/unref would really make me happy and make our core platform library more consistent. I'm happy to rebase if it helps.
I'm sure we can add refcounting to GString without breaking ABI, by having a: typedef struct { GString base; volatile int ref_count; } GRealString; inside gstring.c and then allocating a GRealString internally. The issue is that GString is *not* a string: it's a string builder. Passing it around only makes sense if you're modifying it, not as a reference counted immutable string. If gnome-software is using GString already because it's modifying string buffers then it's fine; and passing around GString instances, instead of playing ping pong with char* and GString, is liable to cut down memory use and fragmentation.
(In reply to Emmanuele Bassi (:ebassi) from comment #22) > I'm sure we can add refcounting to GString without breaking ABI, by having a: > > typedef struct { > GString base; > > volatile int ref_count; > } GRealString; > > inside gstring.c and then allocating a GRealString internally. So, I did that initially in https://bugzilla.gnome.org/attachment.cgi?id=155872&action=diff unless I'm mistaken. I just copied the public struct members to avoid using a 'base' indirection everywhere which seemed to be what other objects did in GLib. > The issue is that GString is *not* a string: it's a string builder. Passing > it around only makes sense if you're modifying it, not as a reference > counted immutable string. Right, and GPtrArray isn't a immutable array either, it's an array builder. Given we can refcount pretty much every other object in GLib (even explicit builders like GKeyFile) I don't think mutability of a refcounted object is a issue here. > If gnome-software is using GString already because it's modifying string > buffers then it's fine; and passing around GString instances, instead of > playing ping pong with char* and GString, is liable to cut down memory use > and fragmentation. Using a refcounted GsString rather than strdup()ing 35 localised string properties in 1400 objects saves me ~2.5Mb per process and allows gnome-software to start 200ms faster (and likely a lot more on ARM).
(In reply to Richard Hughes from comment #23) > (In reply to Emmanuele Bassi (:ebassi) from comment #22) > > I'm sure we can add refcounting to GString without breaking ABI, by having a: > > > > typedef struct { > > GString base; > > > > volatile int ref_count; > > } GRealString; > > > > inside gstring.c and then allocating a GRealString internally. > > So, I did that initially in > https://bugzilla.gnome.org/attachment.cgi?id=155872&action=diff unless I'm > mistaken. I just copied the public struct members to avoid using a 'base' > indirection everywhere which seemed to be what other objects did in GLib. Yep, that's okay. > > The issue is that GString is *not* a string: it's a string builder. Passing > > it around only makes sense if you're modifying it, not as a reference > > counted immutable string. > > Right, and GPtrArray isn't a immutable array either, it's an array builder. Not really: it's a mutable array. :-) > Given we can refcount pretty much every other object in GLib (even explicit > builders like GKeyFile) I don't think mutability of a refcounted object is a > issue here. It's not. It's mostly that we don't really use GString inside our API because it's a "heavy" object. In general, our strings are immutable at the point of entry into an API. Additionally, GString is kind of a C-only API. > > If gnome-software is using GString already because it's modifying string > > buffers then it's fine; and passing around GString instances, instead of > > playing ping pong with char* and GString, is liable to cut down memory use > > and fragmentation. > > Using a refcounted GsString rather than strdup()ing 35 localised string > properties in 1400 objects saves me ~2.5Mb per process and allows > gnome-software to start 200ms faster (and likely a lot more on ARM). How many of those strings are really mutable, though? I mean: are you taking those strings and modifying them, or are you just avoiding a copy and then using the string somewhere else, like you would with a `char*`?
(In reply to Emmanuele Bassi (:ebassi) from comment #24) > > Using a refcounted GsString rather than strdup()ing 35 localised string > > properties in 1400 objects saves me ~2.5Mb per process and allows > > gnome-software to start 200ms faster (and likely a lot more on ARM). > > How many of those strings are really mutable, though? I mean: are you taking > those strings and modifying them, or are you just avoiding a copy and then > using the string somewhere else, like you would with a `char*`? A mixture of both really. The majority of strings never get changed (e.g. the localized application name) but some do get modified or "fixed up" for display, for instance the licence field gets hyperlinked, any trailing spaces are removed. The fixing up isn't in the hot path so to speak, so a refcounted char* would certainly get me 99% the way there too. Certainly g_ref_pointer_alloc0() looks like the more general API, although I don't know if it's too general to actually be obvious. From my point of view there seem to be a few obvious ways to do a reference counted string: inline blob at head, i.e. overallocating and putting the refcount in the first few bytes + less fragmentation + faster as only one alloc + fast to ref and unref - you can't "steal" the char memory - you can't just treat it like a general char* without a silly getter inline blob at tail: + less fragmentation + faster + can treat like a normal char* - you can't just steal the char - need to do strlen just to get the refcount - bad things happen if the string length is ever changed, e.g. truncated wrapped blob, e.g. creating a slice and internal malloc like GString + more fragmentation + slower - can't treat like normal char* + looks different in code to normal char* + fast to refcount + possible to add other things to hidden struct + possible to change internal string + possible to steal internal string hashed blob, e.g. using a static hash table to keep track of memory refcounts + looks, walks and talks like a char* + can modify string as long as start address stays the same + fast-ish to refcount as just needs a hashtable lookup - difficult to clear/destroy hashtable - not going to be clear when a function is returning a ref'd "char*" - not going to be possible to do g_autoptr() to automatically unref the pointer refcount - kinda kludgy returning gpointers everywhere, although with g_ref_pointer_take() it's kinda okay I guess listing out all the +ves and -ves kinda points to a more general solution like g_ref_pointer_*() although I don't know if this is something that you guys still want to pursue.
(In reply to Richard Hughes from comment #25) For added fun and games, I've updated the refcounted allocations patches; you can see them in the wip/ebassi/rc branch of GLib. > > How many of those strings are really mutable, though? I mean: are you taking > > those strings and modifying them, or are you just avoiding a copy and then > > using the string somewhere else, like you would with a `char*`? > > A mixture of both really. The majority of strings never get changed (e.g. > the localized application name) but some do get modified or "fixed up" for > display, for instance the licence field gets hyperlinked, any trailing > spaces are removed. The fixing up isn't in the hot path so to speak, so a > refcounted char* would certainly get me 99% the way there too. > > Certainly g_ref_pointer_alloc0() looks like the more general API, although I > don't know if it's too general to actually be obvious. > From my point of view there seem to be a few obvious ways to do a reference > counted string: This is, by the way, how g_ref_* works in the branch above: > inline blob at head, i.e. overallocating and putting the refcount in the > first few bytes > + less fragmentation > + faster as only one alloc > + fast to ref and unref > - you can't "steal" the char memory > - you can't just treat it like a general char* without a silly getter You don't need a getter; if you overallocate and return a pointer to the content then you can treat the allocation as any other char*. We do this all the time in GObject. :-) Stealing refcounted data is always tricky, unless you have weak references to it. For instance: char *str = g_ref_string_steal (orig_str); now what does 'orig_str' points at? The same pointer, with reference count intact? Or do we set `orig_str[0] = '\0'` and assume that every string function will always stop at the first NUL byte? Or do we forcibly drop the reference to 0, and orig_str will now point at invalid memory? > I guess listing out all the +ves and -ves kinda points to a more general > solution like g_ref_pointer_*() although I don't know if this is something > that you guys still want to pursue. I'd still want it to be available. GSK has started adding a bunch of small data refcounted structures, and having to implement reference counting all the time is error prone.
(In reply to Emmanuele Bassi (:ebassi) from comment #26) > You don't need a getter; if you overallocate and return a pointer to the > content then you can treat the allocation as any other char*. We do this all > the time in GObject. :-) Right, makes sense. I guess another negative is you don't know if a random pointer is refcounted or not without documenting it somewhere in an API. I don't know of a way of saying "for this random pointer in an allocated block, give me the start address of this block that we got from malloc" but I'm happy to be proved wrong. > I'd still want it to be available. GSK has started adding a bunch of small > data refcounted structures, and having to implement reference counting all > the time is error prone. Right, if you want to dupe this bug with #622721 that's fine with me, but I think we need to do /something/ this year. :)
Let’s dupe this to bug #622721 then. I’ll add it to the 2.56 milestone with the goal of doing *something* this year. *** This bug has been marked as a duplicate of bug 622721 ***