Bug 613255 – Read-only, non-DBus, store access

After an evaluation, GNOME has moved from Bugzilla to GitLab. Learn more about GitLab.
No new issues can be reported in GNOME Bugzilla anymore.
To report an issue in a GNOME project, go to GNOME GitLab.
Do not go to GNOME Gitlab for: Bluefish, Doxygen, GnuCash, GStreamer, java-gnome, LDTP, NetworkManager, Tomboy.

Bug 613255 - Read-only, non-DBus, store access


Summary:	Read-only, non-DBus, store access


Status:	RESOLVED FIXED

Product:	tracker
Classification:	Core
Component:	Store
Version:	unspecified
Hardware:	Other Linux

Importance:	Normal enhancement
Target Milestone:	---
Assigned To:	tracker-general
QA Contact:	Jamie McCracken

URL:
Whiteboard:

Depends on:
Blocks:	613258

Reported:	2010-03-18 16:58 UTC by Bastien Nocera
Modified:	2010-08-18 14:52 UTC

See Also:
GNOME target:	---
GNOME version:	---

Description Bastien Nocera 2010-03-18 16:58:58 UTC

As with large datasets, getting a list of items that were indexed would be slowed down by the D-Bus connection. If applications are only interested in small datasets, they could use a single D-Bus connection.

The tracker API should make the read or write accesses transparent, so that large, mostly read accesses to the database are not hampered by the less frequent need to write metadata.

Comment 1 Philip Van Hoof 2010-03-19 15:57:47 UTC

Because SQLite isn't MVCC this isn't as easy as it sounds. Although for reading it can probably be done. The problem is that, depending on certain things like sqlite's own caches, can sqlite3_step block another sqlite3_open's sqlite3_step calls. Especially when having large transactions open (which we at tracker-store do for the purpose of performance).

We are investigating this but it's not likely that support this will arrive soon.

Also note that of course a client developer should never try to access the database itself directly. But we have designed libtracker-data in such a way that you can indeed do the tracker_sparql_query_new in your own process too. We did this with the idea in mind that maybe someday we'll make libtracker-data available for client developers who don't want to query over DBus.

For writing it'll of course always be problematic if you want to avoid the DBus overhead. But that's not requested in this bug.

Comment 2 Ruben Vermeersch 2010-03-19 16:33:16 UTC

Bastien: What you are asking is basically similar to the access model of dconf, right?

Comment 3 Bastien Nocera 2010-03-19 16:42:20 UTC

(In reply to comment #2)
> Bastien: What you are asking is basically similar to the access model of dconf,
> right?

I have no idea what DConf's access model is.

I'm asking for is faster access to the data in the store when read-only. Going over D-Bus for 100k tracks queried by Rhythmbox, on startup, wouldn't be workable for example.

Comment 4 Philip Van Hoof 2010-03-19 16:52:22 UTC

(In reply to comment #2)
> Bastien: What you are asking is basically similar to the access model of dconf,
> right?

Note that DConf uses an mmap(), this of course allows concurrent access. With a database like SQLite that needs to ensure that you don't get results in your SELECT that aren't committed by the open transaction yet, it's not as easy.

So SQLite allows a sort of read-only concurrent read access per sqlite3_open() until its transaction buffer is filled up. Because we have very large transactions open in tracker-store (for aforementioned performance reasons) you'll most likely always block on our big transaction.

It can take longer than the performance improvement you'd see by avoiding DBus before your sqlite3_open() connection gets access to said lock. As the connection that the DBus interface uses is the same as the transaction one (it just does the SELECT inside of the big transaction, which of course works fine), querying over DBus might return sooner.

Trying to share the sqlite3_open() with multiple processes doesn't sound very realistic to me as a workaround for that. So I think if the marshalling/ demarshalling of DBus is really problematic, that a custom UNIX socket based IPC will be a better solution. That's not going to be a performance overhead because such a solution ends up being translated to a few memcpy()s in the kernel (you'll have other bottlenecks to worry about first).

DBus is of course also a unix-socket based IPC, but the performance issues are mostly with marshalling and demarshalling afaik.

(In reply to comment #3)
> I have no idea what DConf's access model is.
> 
> I'm asking for is faster access to the data in the store when read-only. Going
> over D-Bus for 100k tracks queried by Rhythmbox, on startup, wouldn't be
> workable for example.

That's why you have LIMIT and OFFSET in SPARQL: you fetch it in pages. I don't know of many use-cases where a UI like Rhythmbox needs to fetch 100k things.

I know what some people might think: sorting! But ...

When an application wants to do the sorting of a huge data model clientside, then it's misdesigned. I know many apps insist on this because that fancy GtkTreeModelSort is so nice and easy to use (I hope Rhythmbox doesn't use that, but I don't know), but for large data models that is absolutely wrong, and again wrong. That's called broken by design.

Comment 5 Philip Van Hoof 2010-03-19 16:55:26 UTC

 
> Note that DConf uses an mmap(), this of course allows concurrent access. With a
> database like SQLite that needs to ensure that you don't get results in your
> SELECT that aren't committed by the open transaction yet, it's not as easy.

Note that our solution would work with lowering this requirement, as in our use-case this isn't required. And this is why MVCC in SQLite can block your second sqlite3_open()'s sqlite3_step() calls.

Anyway, it's not easy or something. It'll require some serious SQLite3 hacks.

Comment 6 Martyn Russell 2010-03-19 17:10:26 UTC

Philip has a point and I was about to say the same thing. The user can't process 100k items at once and the screen usually can't show it either. This strains the service and is usually a poorly designed solution. We have seen this time and time again in Maemo.

Comment 7 Ruben Vermeersch 2010-03-20 12:04:35 UTC

The problem is that a cursor-like API (where you only fetch the 30 rows that are needed when they are needed/displayed) only highlights the IPC-roundtrip problem even more: I don't want every operation in my application being slowed down because an IPC call has to be made (with the associated context switches).

GConf has pretty much learned us that doing a ton of small IPC round-trips is not a workable solution. And this was just for preferences, I don't even want to consider the pain this will give if we store/access all metadata that way.

Comment 8 Philip Van Hoof 2010-03-20 12:41:05 UTC

GConf's problem was that the ORBit2 service wasn't a multithreaded one (and that it was absolutely not coded for multithreading). This meant that when two or more processes accessed the service simultaneously, that the servicing of the requests became serialized (one had to wait on the other).

GConf's other problem was at some point that it had a very slow data store that insisted on using so called "readable XML files" (that insistence came from a strange ideology that people had a few years ago). I recall that this got replaced with a in-memory GMarkup solution, so I'm guessing that GConf's data-access performance problem has been turned into a "it's a bloody memory-hog" one. Strange solution ... oh well.

Context switching doesn't sound to me like necessarily a bottleneck for user interface software that draws its visual content much much slower than any form of process switching caused by two entangled processes doing IPC. At least not on a modern desktop PC. On mobile we sometimes start worrying about this. But it's rare. For realtime things like sound it's more of a problem.

Besides, the X11 server and (in Ruben's case) Banshee are also doing quite large amounts of IPC just for "instructing the xserver to draw stuff". So IPC itself (when over a typical Unix socket) isn't slow. Especially not when you group the requests together (pipeline them), as described recently on the foundation mailing list by Alan.

DConf of course doesn't have GConf's "serializing of client requests" problem for reading clients, because it allows concurrent read access to the shared mmap() by the client processes themselves.

But as I mentioned earlier isn't this as easy as it sounds on a non-MVCC database like SQLite.

Comment 9 Ivan Frade 2010-03-25 14:29:30 UTC

(In reply to comment #7)
> The problem is that a cursor-like API (where you only fetch the 30 rows that
> are needed when they are needed/displayed) only highlights the IPC-roundtrip
> problem even more: I don't want every operation in my application being slowed
> down because an IPC call has to be made (with the associated context switches).

Yes, in a "window" model to retrieve results you need more IPC, but our argument is that _the first call_ (that includes what the user needs to see in the screen) will be much faster. For example a query with 10000 results: 

Option 1 - retrieving all:
 Query time (e.g. 1 sec), IPC time (e.g. 5 sec.)

 Total time: 6 sec.
 Time perceived by the user: 6 sec.


Option 2 - retrieving in windows of 100 elements:
 Query time (e.g. 1 sec), IPC time (e.g. now only 1 sec.)

 Total time: 2 seconds * 10 = 20 seconds
    (worst case, when the user scroll the whole list) 
 Time perceived by the user: 2 seconds

The numbers are not real; and the window size can be adjusted. 

This discussion is very interesting, we could continue in the mailing list. We are on time to think in the right solution for this (very common) problem; and we need applications feedback to make sure we propose something sensible.

Comment 10 Philip Van Hoof 2010-08-18 14:52:08 UTC

With direct-access having been merged into master, I'm closing this bug.

Use the libtracker-sparql API, and you can enjoy a direct connection to the meta.db SQLite database.