After an evaluation, GNOME has moved from Bugzilla to GitLab. Learn more about GitLab.
No new issues can be reported in GNOME Bugzilla anymore.
To report an issue in a GNOME project, go to GNOME GitLab.
Do not go to GNOME Gitlab for: Bluefish, Doxygen, GnuCash, GStreamer, java-gnome, LDTP, NetworkManager, Tomboy.
Bug 735406 - Entire file content logged on "Unable to insert multiple values for subject ..." errors
Entire file content logged on "Unable to insert multiple values for subject ....
Status: RESOLVED FIXED
Product: tracker
Classification: Core
Component: General
unspecified
Other Linux
: Normal normal
: ---
Assigned To: tracker-general
tracker-general
: 733809 741137 743028 747198 (view as bug list)
Depends on:
Blocks: 748045 750309
 
 
Reported: 2014-08-25 19:13 UTC by Elad Alfassa
Modified: 2015-07-12 19:30 UTC
See Also:
GNOME target: ---
GNOME version: ---


Attachments
libtracker-miner: Restrict the amount of data that is logged for errors (1.27 KB, patch)
2014-12-02 16:10 UTC, Debarshi Ray
none Details | Review
first ~4MB of /usr/libexec/tracker-extract -v 3 > tracker-extract-log.txt 2>&1 (1006.40 KB, application/x-compressed-tar)
2014-12-18 11:20 UTC, zachw
  Details
first file causes failed insert (287.84 KB, application/pdf)
2014-12-18 11:21 UTC, zachw
  Details
tracker extract output after deleting tracker's stored data (1.16 MB, text/plain)
2015-01-11 01:53 UTC, zachw
  Details
libtracker-miner: Restrict the amount of data that is logged for errors (1.87 KB, patch)
2015-01-13 14:02 UTC, Debarshi Ray
committed Details | Review

Description Elad Alfassa 2014-08-25 19:13:17 UTC
Tracker is dumping the entire text of the discworld series (which I have as PDFs) into my system log, together with metadata of every song I have in my music directory. The logging verbosity is set to "errors", but tracker ignores that.

apparently it's failing to insert things to the db, so I get things like
Aug 25 22:08:49 rincewind gnome-session[3446]: (tracker-extract:3903): Tracker-WARNING **: Task 0, error: Unable to insert multiple values for subject `urn:uuid:c87bf209-f8c1-14dc-be7e-ff98073c0169' and single v
Aug 25 22:08:49 rincewind gnome-session[3446]: (tracker-extract:3903): Tracker-WARNING **: Sparql update was:
Aug 25 22:08:49 rincewind gnome-session[3446]: INSERT {
Aug 25 22:08:49 rincewind gnome-session[3446]: GRAPH <urn:uuid:472ed0cc-40ff-4e37-9c0c-062d78656540> {


followed by shitton of stuff that shouldn't be in the logs.
Comment 1 Martyn Russell 2014-08-26 08:59:30 UTC
(In reply to comment #0)
> Tracker is dumping the entire text of the discworld series (which I have as
> PDFs) into my system log, together with metadata of every song I have in my
> music directory. The logging verbosity is set to "errors", but tracker ignores
> that.
> 
> apparently it's failing to insert things to the db, so I get things like
> Aug 25 22:08:49 rincewind gnome-session[3446]: (tracker-extract:3903):
> Tracker-WARNING **: Task 0, error: Unable to insert multiple values for subject
> `urn:uuid:c87bf209-f8c1-14dc-be7e-ff98073c0169' and single v
> Aug 25 22:08:49 rincewind gnome-session[3446]: (tracker-extract:3903):
> Tracker-WARNING **: Sparql update was:
> Aug 25 22:08:49 rincewind gnome-session[3446]: INSERT {
> Aug 25 22:08:49 rincewind gnome-session[3446]: GRAPH
> <urn:uuid:472ed0cc-40ff-4e37-9c0c-062d78656540> {
> 
> 
> followed by shitton of stuff that shouldn't be in the logs.

Thanks for the bug report Elad.
This can be 2 things:

1. The extractor is using ontology that's out of date or not common and it's broken.

2. Your DB is broken.

Are you able to attach more detailed logs so we know exactly what the issues are, specifically:

a) what extractor you're using
b) what file you're extracting
c) what DB insert is being attempted.

We should at least be able to sanity check then the ontology used by the extractor and if it is valid.

Thanks :)
Comment 2 Martyn Russell 2014-08-26 17:20:38 UTC
Hi, I can reproduce this actually, not sure why it's doing this. I will investigate some more :)
Comment 3 Elad Alfassa 2014-08-31 15:53:54 UTC
Speaking of log spam, tracker is also spamming my log with

Aug 31 18:52:45 rincewind org.freedesktop.Tracker1[1233]: (tracker-store:1681): GLib-CRITICAL **: Source ID 478282 was not found when attempting to remove it
Aug 31 18:52:45 rincewind org.freedesktop.Tracker1[1233]: (tracker-store:1681): GLib-CRITICAL **: Source ID 478285 was not found when attempting to remove it
Aug 31 18:52:45 rincewind org.freedesktop.Tracker1[1233]: (tracker-store:1681): GLib-CRITICAL **: Source ID 478288 was not found when attempting to remove it
Aug 31 18:52:45 rincewind org.freedesktop.Tracker1[1233]: (tracker-store:1681): GLib-CRITICAL **: Source ID 478291 was not found when attempting to remove it
Aug 31 18:52:45 rincewind org.freedesktop.Tracker1[1233]: (tracker-store:1681): GLib-CRITICAL **: Source ID 478294 was not found when attempting to remove it



constantly.
Comment 4 Martyn Russell 2014-09-05 09:16:52 UTC
Hmm, this doesn't seem to happen all the time and it looks like it is reproducible more for removable media or non-local file systems.

Anyone else experiencing this for these conditions too?
Comment 5 Branko Grubic (bitlord) 2014-09-18 06:08:27 UTC
It happened here on f21 alpha_rc1, only local files indexed '~/'
fresh install, fresh configs
tracker-1.1.4-2.fc21.x86_64
Comment 6 Branko Grubic (bitlord) 2014-09-18 06:34:44 UTC
A lot of useless data gets into journal, I'm not sure is this one tracker related, but here are some examples 

Sep 18 07:57:05 localhost.localdomain gnome-session[1678]: nao:hasTag ?tag5 ;
Sep 18 07:57:05 localhost.localdomain gnome-session[1678]: nao:hasTag ?tag6 ;
Sep 18 07:57:05 localhost.localdomain gnome-session[1678]: nao:hasTag ?tag7 ;
Sep 18 07:57:05 localhost.localdomain gnome-session[1678]: nao:hasTag ?tag8 ;
Sep 18 07:57:05 localhost.localdomain gnome-session[1678]: nao:hasTag ?tag9 ;
Sep 18 07:57:05 localhost.localdomain gnome-session[1678]: nao:hasTag ?tag10 ;
Sep 18 07:57:05 localhost.localdomain gnome-session[1678]: nao:hasTag ?tag11 ;
Sep 18 07:57:05 localhost.localdomain gnome-session[1678]: nao:hasTag ?tag12 ;
Sep 18 07:57:05 localhost.localdomain gnome-session[1678]: nao:hasTag ?tag13 ;
....
Sep 18 07:57:05 localhost.localdomain gnome-session[1678]: ?tag4 a nao:Tag ; nao:prefLabel " beauty" .
Sep 18 07:57:05 localhost.localdomain gnome-session[1678]: ?tag5 a nao:Tag ; nao:prefLabel " beautiful" .
Sep 18 07:57:05 localhost.localdomain gnome-session[1678]: ?tag6 a nao:Tag ; nao:prefLabel " bedroom" .
Sep 18 07:57:05 localhost.localdomain gnome-session[1678]: ?tag7 a nao:Tag ; nao:prefLabel " caucasian" .
Sep 18 07:57:05 localhost.localdomain gnome-session[1678]: ?tag8 a nao:Tag ; nao:prefLabel " casual" .
Sep 18 07:57:05 localhost.localdomain gnome-session[1678]: ?tag9 a nao:Tag ; nao:prefLabel " cool" .


I don't know where it gets all of these (maybe in some document downloaded long time ago from the internet, but most of my stuff is related to technology ...)


Sep 18 07:57:05 localhost.localdomain gnome-session[1678]: (tracker-extract:1999): Tracker-WARNING **: Task 1, error: Subject `urn:uuid:<SOME_UUID> is not in domain `nco:Contact' of property `nco:representative'
Sep 18 07:57:05 localhost.localdomain gnome-session[1678]: (tracker-extract:1999): Tracker-WARNING **: Sparql update was:
Sep 18 07:57:05 localhost.localdomain gnome-session[1678]: INSERT {
Sep 18 07:57:05 localhost.localdomain gnome-session[1678]: GRAPH <urn:uuid:<SOME_UUID> {
Sep 18 07:57:05 localhost.localdomain gnome-session[1678]: <urn:contact:http:%2F%2Fwww.fotolia.com> a nco:Contact ;
Sep 18 07:57:05 localhost.localdomain gnome-session[1678]: nco:fullname "http://www.fotolia.com" .
Sep 18 07:57:05 localhost.localdomain gnome-session[1678]: }

Not sure where it got this latest URL, possibly from some image/media metadata? I never saw that webpage before. ( I didn't find it in tracker source (with a quick grep ) like other URLs in https://bugzilla.gnome.org/show_bug.cgi?id=736848 )
Comment 7 Martyn Russell 2014-10-19 13:25:24 UTC
(In reply to comment #6)
> A lot of useless data gets into journal, I'm not sure is this one tracker
> related, but here are some examples 
> 
> Sep 18 07:57:05 localhost.localdomain gnome-session[1678]: nao:hasTag ?tag5 ;
> Sep 18 07:57:05 localhost.localdomain gnome-session[1678]: nao:hasTag ?tag6 ;
> Sep 18 07:57:05 localhost.localdomain gnome-session[1678]: nao:hasTag ?tag7 ;
> Sep 18 07:57:05 localhost.localdomain gnome-session[1678]: nao:hasTag ?tag8 ;
> Sep 18 07:57:05 localhost.localdomain gnome-session[1678]: nao:hasTag ?tag9 ;
> Sep 18 07:57:05 localhost.localdomain gnome-session[1678]: nao:hasTag ?tag10 ;
> Sep 18 07:57:05 localhost.localdomain gnome-session[1678]: nao:hasTag ?tag11 ;
> Sep 18 07:57:05 localhost.localdomain gnome-session[1678]: nao:hasTag ?tag12 ;
> Sep 18 07:57:05 localhost.localdomain gnome-session[1678]: nao:hasTag ?tag13 ;
> ....
> Sep 18 07:57:05 localhost.localdomain gnome-session[1678]: ?tag4 a nao:Tag ;
> nao:prefLabel " beauty" .
> Sep 18 07:57:05 localhost.localdomain gnome-session[1678]: ?tag5 a nao:Tag ;
> nao:prefLabel " beautiful" .
> Sep 18 07:57:05 localhost.localdomain gnome-session[1678]: ?tag6 a nao:Tag ;
> nao:prefLabel " bedroom" .
> Sep 18 07:57:05 localhost.localdomain gnome-session[1678]: ?tag7 a nao:Tag ;
> nao:prefLabel " caucasian" .
> Sep 18 07:57:05 localhost.localdomain gnome-session[1678]: ?tag8 a nao:Tag ;
> nao:prefLabel " casual" .
> Sep 18 07:57:05 localhost.localdomain gnome-session[1678]: ?tag9 a nao:Tag ;
> nao:prefLabel " cool" .

If this is really gnome-session producing this output, it's not Tracker's fault as far as I can see.
 
> I don't know where it gets all of these (maybe in some document downloaded long
> time ago from the internet, but most of my stuff is related to technology ...)
> 
> 
> Sep 18 07:57:05 localhost.localdomain gnome-session[1678]:
> (tracker-extract:1999): Tracker-WARNING **: Task 1, error: Subject
> `urn:uuid:<SOME_UUID> is not in domain `nco:Contact' of property
> `nco:representative'
> Sep 18 07:57:05 localhost.localdomain gnome-session[1678]:
> (tracker-extract:1999): Tracker-WARNING **: Sparql update was:
> Sep 18 07:57:05 localhost.localdomain gnome-session[1678]: INSERT {
> Sep 18 07:57:05 localhost.localdomain gnome-session[1678]: GRAPH
> <urn:uuid:<SOME_UUID> {
> Sep 18 07:57:05 localhost.localdomain gnome-session[1678]:
> <urn:contact:http:%2F%2Fwww.fotolia.com> a nco:Contact ;
> Sep 18 07:57:05 localhost.localdomain gnome-session[1678]: nco:fullname
> "http://www.fotolia.com" .
> Sep 18 07:57:05 localhost.localdomain gnome-session[1678]: }
> 
> Not sure where it got this latest URL, possibly from some image/media metadata?
> I never saw that webpage before. ( I didn't find it in tracker source (with a
> quick grep ) like other URLs in
> https://bugzilla.gnome.org/show_bug.cgi?id=736848 )

Any warnings or errors should be logged. They technically shouldn't be occurring and should be fixed.
Comment #1 is showing signs of a broken extractor using ontology or a database schema that's not valid and should be fixed.

--

I still don't see what there is to fix in Tracker here?
Comment 8 Gene Czarcinski 2014-11-18 16:03:05 UTC
With Fedora 20 running gnome-shell-3.10.4-9 and tracker-0.16.5-1 there is no problem.  I did a fresh install of Fedora 21 TC2 with gnome-shell-3.14.2-1 and tracker-1.2.4-2 there is the problem.

It was a fresh install of the system but my home diretories and all the data under them is the same as it was with Fedora 20.

Some of the comments imply that I might need to start fresh with the home directories and then selectively copy the old data to the new home directories.

Commnets?
Comment 9 Martyn Russell 2014-12-02 10:43:08 UTC
Either way, I am going to reassign the bug to gnome-session. This code is not in Tracker, you can see that using:

$ git grep -i "nao:hasTag ?tag"

...
Comment 10 Debarshi Ray 2014-12-02 14:50:28 UTC
Martyn, I think this is tracker. See this downstream bug https://bugzilla.redhat.com/show_bug.cgi?id=1148570 and the attached log.

Sep 28 18:37:19 odvarok org.freedesktop.Tracker1.Miner.Extract[6516]: (tracker-extract:12297): Tracker-CRITICAL **: tracker_sparql_connection_query: assertion 'self != NULL' failed
Sep 28 18:37:29 odvarok org.freedesktop.Tracker1.Miner.Extract[6516]: (tracker-extract:12297): Tracker-WARNING **: Task 32, error: Unable to insert multiple values for subject `urn:uuid:96753885-c1dd-f2f1-a46b-62d3fbda42ed' and single valued property `dc:creator' (old_value: '101100', new value: 'urn:uuid:f63415df-ff52-10f2-a4c2-9332531c03cc')
Sep 28 18:37:29 odvarok org.freedesktop.Tracker1.Miner.Extract[6516]: (tracker-extract:12297): Tracker-WARNING **: Sparql update was:
Sep 28 18:37:29 odvarok org.freedesktop.Tracker1.Miner.Extract[6516]: INSERT {
Sep 28 18:37:29 odvarok org.freedesktop.Tracker1.Miner.Extract[6516]: GRAPH <urn:uuid:472ed0cc-40ff-4e37-9c0c-062d78656540> {
Sep 28 18:37:29 odvarok org.freedesktop.Tracker1.Miner.Extract[6516]: <urn:uuid:96753885-c1dd-f2f1-a46b-62d3fbda42ed> nie:dataSource <http://www.tracker-project.org/ontologies/tracker#extractor-data-source> .
Sep 28 18:37:29 odvarok org.freedesktop.Tracker1.Miner.Extract[6516]: <urn:uuid:96753885-c1dd-f2f1-a46b-62d3fbda42ed> a nfo:PaginatedTextDocument ;
Sep 28 18:37:29 odvarok org.freedesktop.Tracker1.Miner.Extract[6516]: nie:title "nsa-tao-ant.PDF" ;
Sep 28 18:37:29 odvarok org.freedesktop.Tracker1.Miner.Extract[6516]: nco:creator [ a nco:Contact ;
Sep 28 18:37:29 odvarok org.freedesktop.Tracker1.Miner.Extract[6516]: nco:fullname "J"] ;
Sep 28 18:37:29 odvarok org.freedesktop.Tracker1.Miner.Extract[6516]: nfo:pageCount 3 ;
Sep 28 18:37:29 odvarok org.freedesktop.Tracker1.Miner.Extract[6516]: nie:plainTextContent "Der Spiegel, 29 December 2013 (images from purchased German edition:\nhttp://cryptome.org/2013/12/nsa-tao-ant-pdf.pdf )\nEnglish version (without following images):\nhttp://www.spiegel.de/international/world/the-nsa-uses-powerful-toolbox-in-effort-to-spy-\non-global-networks-a-940969.html Der Spiegel, 29 December 2013 (images from purchased German edition:\nhttp://cryptome.org/2013/12/nsa-tao-ant-pdf.pdf )\nEnglish version (without following images):\nhttp://www.spiegel.de/international/world/catalog-reveals-nsa-has-back-doors-for-\nnumerous-devices-a-940994.html " .
Sep 28 18:37:29 odvarok org.freedesktop.Tracker1.Miner.Extract[6516]: }
Sep 28 18:37:29 odvarok org.freedesktop.Tracker1.Miner.Extract[6516]: }

This is coming from decorator_commit_cb in tracker-decorator.c
Comment 11 Debarshi Ray 2014-12-02 16:10:39 UTC
Created attachment 292003 [details] [review]
libtracker-miner: Restrict the amount of data that is logged for errors

I arbitrarily used a length of 256 because that should be enough to tell us what is going on.
Comment 12 Martyn Russell 2014-12-02 19:30:49 UTC
Hi Debarshi, there are 2 situations here:

1. The "unable to insert" ... log message - that's actually about a broken extractor trying to insert multi-valued information into a single valued property (IIRC). That's either a broken extractor or the ontology changed and the extractor didn't get updated - OR a new local ontology that extractors are not working well with (i.e. some new ontology that's not upstream).

2. The "nao:Tag" ... log messages - that's not in the Tracker code base because I checked the code and it's clear to me it must be gnome-session.

I was referring to #2 here.

**

For #1, I think we should have a separate bug and fix the extractor causing the problems. For that we would need a file uploaded to test with to reproduce.

Regarding the patch, we could implement something like that but these errors should _NEVER_ happen, so if they do, you've got one broken extractor and a lot of files triggering it. Ultimately, the spamming of the logs should be a sign - FIX THE CODE not REMOVE THE LOGS.
Comment 13 Debarshi Ray 2014-12-02 20:31:36 UTC
(In reply to comment #12)

> 2. The "nao:Tag" ... log messages - that's not in the Tracker code base because
> I checked the code and it's clear to me it must be gnome-session.

I really doubt that gnome-session is writing log messages with "nao:Tag" in it. :-)

On systems with the systemd journal (eg., Fedora), gnome-session redirects the stdio and stderr of DBus activated services to the journal. Which is why they show up prefixed with 'gnome-session'.

I think those lines are just part of some query that failed due to a broken extractor. We only have random snippets of the log in this bug, but if we had the entire logs it would have had the error message at the top.

> I was referring to #2 here.
> 
> **

> For #1, I think we should have a separate bug and fix the extractor causing the
> problems. For that we would need a file uploaded to test with to reproduce.

I have asked the guy who filed the downstream report for the files that have been triggering this. Looking at the logs in https://bugzilla.redhat.com/show_bug.cgi?id=1148570 , we can somewhat narrow down the culprit a bit. It is some extractor that deals with nie:plainTextContent and nfo:PaginatedTextDocument. In some cases I see 'dc:format "application/pdf"', so a PDF is involved.

> Regarding the patch, we could implement something like that but these errors
> should _NEVER_ happen, so if they do, you've got one broken extractor and a lot
> of files triggering it. Ultimately, the spamming of the logs should be a sign -
> FIX THE CODE not REMOVE THE LOGS.

Do we really want to send gigantic strings to the logs? eg., having the entire value of nie:plainTextContent doesn't look very useful. We could parse the string and replace 'nie:plainTextContent "blah, blah, blah, blah"' with 'nie:plainTextContent "..."', but I don't know if it is worth the trouble. Particularly when we are already in a error path. I thought that it is safer to keep the error reporting code as simple as possible. After all, we only want the general structure of the failed query, not the entire text of the file.

Files in the wild will keep triggering bugs in the extractor and we will keep fixing them, but in the meanwhile dumping huge strings into the logs just makes tracker look bad.
Comment 14 Martyn Russell 2014-12-03 09:58:28 UTC
(In reply to comment #13)
> (In reply to comment #12)
> 
> > 2. The "nao:Tag" ... log messages - that's not in the Tracker code base because
> > I checked the code and it's clear to me it must be gnome-session.
> 
> I really doubt that gnome-session is writing log messages with "nao:Tag" in it.
> :-)

I wondered too ... but I know it's not Tracker, so it must be another element in the stack using Tracker.
 
> On systems with the systemd journal (eg., Fedora), gnome-session redirects the
> stdio and stderr of DBus activated services to the journal. Which is why they
> show up prefixed with 'gnome-session'.

Yea.
 
> I think those lines are just part of some query that failed due to a broken
> extractor. We only have random snippets of the log in this bug, but if we had

The tagging doesn't look like it's related to the extractor failing though.

> > I was referring to #2 here.
> > 
> > **
> 
> > For #1, I think we should have a separate bug and fix the extractor causing the
> > problems. For that we would need a file uploaded to test with to reproduce.
> 
> I have asked the guy who filed the downstream report for the files that have
> been triggering this. Looking at the logs in

Ah thanks!

> https://bugzilla.redhat.com/show_bug.cgi?id=1148570 , we can somewhat narrow
> down the culprit a bit. It is some extractor that deals with
> nie:plainTextContent and nfo:PaginatedTextDocument. In some cases I see
> 'dc:format "application/pdf"', so a PDF is involved.

Yea.
 
> > Regarding the patch, we could implement something like that but these errors
> > should _NEVER_ happen, so if they do, you've got one broken extractor and a lot
> > of files triggering it. Ultimately, the spamming of the logs should be a sign -
> > FIX THE CODE not REMOVE THE LOGS.
> 
> Do we really want to send gigantic strings to the logs? eg., having the entire
> value of nie:plainTextContent doesn't look very useful. We could parse the
> string and replace 'nie:plainTextContent "blah, blah, blah, blah"' with
> 'nie:plainTextContent "..."', but I don't know if it is worth the trouble.

That depends, earlier you said "...if we had the entire logs...". The most important thing is that we can identify the file and/or the properties causing the problems. If we cut half that content off, there is a good chance we don't know enough to reproduce the problem or fix it.

> Particularly when we are already in a error path. I thought that it is safer to
> keep the error reporting code as simple as possible. After all, we only want
> the general structure of the failed query, not the entire text of the file.

In my mind, we're trying to optimise something that should never happen - and when it does happen we or I want as much information as possible. You have a point about dumping 1Mb of text from a file into the logs though. My main concern is cutting off content when we have a LOT of properties for a file. I am not sure what the limit should be, but I would rather not cut the details that help us solve the problem.
 
> Files in the wild will keep triggering bugs in the extractor and we will keep
> fixing them, but in the meanwhile dumping huge strings into the logs just makes
> tracker look bad.

It does, you're right, but so does not seeing the bug and not supporting important information in files.

:)
Comment 15 Debarshi Ray 2014-12-04 11:00:42 UTC
(In reply to comment #14)
> (In reply to comment #13)
> > (In reply to comment #12)
> > 
> > > 2. The "nao:Tag" ... log messages - that's not in the Tracker code base because
> > > I checked the code and it's clear to me it must be gnome-session.
> > 
> > I really doubt that gnome-session is writing log messages with "nao:Tag" in it.
> > :-)
> 
> I wondered too ... but I know it's not Tracker, so it must be another element
> in the stack using Tracker.

comment 1 has:
Aug 25 22:08:49 rincewind gnome-session[3446]: (tracker-extract:3903):
Tracker-WARNING **: Task 0, error: Unable to insert multiple values for subject
`urn:uuid:c87bf209-f8c1-14dc-be7e-ff98073c0169' and single v
Aug 25 22:08:49 rincewind gnome-session[3446]: (tracker-extract:3903):
Tracker-WARNING **: Sparql update was:
Aug 25 22:08:49 rincewind gnome-session[3446]: INSERT {
Aug 25 22:08:49 rincewind gnome-session[3446]: GRAPH
<urn:uuid:472ed0cc-40ff-4e37-9c0c-062d78656540> {

comment 6 has:
Sep 18 07:57:05 localhost.localdomain gnome-session[1678]:
(tracker-extract:1999): Tracker-WARNING **: Task 1, error: Subject
`urn:uuid:<SOME_UUID> is not in domain `nco:Contact' of property
`nco:representative'
Sep 18 07:57:05 localhost.localdomain gnome-session[1678]:
(tracker-extract:1999): Tracker-WARNING **: Sparql update was:
Sep 18 07:57:05 localhost.localdomain gnome-session[1678]: INSERT {
Sep 18 07:57:05 localhost.localdomain gnome-session[1678]: GRAPH
<urn:uuid:<SOME_UUID> {
Sep 18 07:57:05 localhost.localdomain gnome-session[1678]:
<urn:contact:http:%2F%2Fwww.fotolia.com> a nco:Contact ;
Sep 18 07:57:05 localhost.localdomain gnome-session[1678]: nco:fullname
"http://www.fotolia.com" .
Sep 18 07:57:05 localhost.localdomain gnome-session[1678]: }

As for these, they might even be from the nautilus plugin:
Sep 18 07:57:05 localhost.localdomain gnome-session[1678]: nao:hasTag ?tag5 ;
Sep 18 07:57:05 localhost.localdomain gnome-session[1678]: nao:hasTag ?tag6 ;
Sep 18 07:57:05 localhost.localdomain gnome-session[1678]: nao:hasTag ?tag7 ;
Sep 18 07:57:05 localhost.localdomain gnome-session[1678]: nao:hasTag ?tag8 ;
Sep 18 07:57:05 localhost.localdomain gnome-session[1678]: nao:hasTag ?tag9 ;
Sep 18 07:57:05 localhost.localdomain gnome-session[1678]: nao:hasTag ?tag10 ;
Sep 18 07:57:05 localhost.localdomain gnome-session[1678]: nao:hasTag ?tag11 ;
Sep 18 07:57:05 localhost.localdomain gnome-session[1678]: nao:hasTag ?tag12 ;
Sep 18 07:57:05 localhost.localdomain gnome-session[1678]: nao:hasTag ?tag13 ;
....
Sep 18 07:57:05 localhost.localdomain gnome-session[1678]: ?tag4 a nao:Tag ;
nao:prefLabel " beauty" .
Sep 18 07:57:05 localhost.localdomain gnome-session[1678]: ?tag5 a nao:Tag ;
nao:prefLabel " beautiful" .
Sep 18 07:57:05 localhost.localdomain gnome-session[1678]: ?tag6 a nao:Tag ;
nao:prefLabel " bedroom" .
Sep 18 07:57:05 localhost.localdomain gnome-session[1678]: ?tag7 a nao:Tag ;
nao:prefLabel " caucasian" .
Sep 18 07:57:05 localhost.localdomain gnome-session[1678]: ?tag8 a nao:Tag ;
nao:prefLabel " casual" .
Sep 18 07:57:05 localhost.localdomain gnome-session[1678]: ?tag9 a nao:Tag ;
nao:prefLabel " cool" .
Comment 16 zachw 2014-12-10 09:33:00 UTC
Added log and source pdf to the redhat bug 1148570 a few days ago. Is there more info you need?
Comment 17 Martyn Russell 2014-12-15 09:44:13 UTC
(In reply to comment #16)
> Added log and source pdf to the redhat bug 1148570 a few days ago. Is there
> more info you need?

While the patch from Rishi is reasonable, I don't think it's the right approach, The right approach is to fix the bug not the logging the bug causes. Stemming the log to a particular length will ultimately make our job harder in the future because we won't have the full details printed. Again, these logs should *never* occur, if they do, it's a clear message - IT SHOULD BE FIXED.

Rishi, did you have more information about the actual bug so we can fix this and get it pushed to master / released?

Finally, I don't know why people are always so interested in removing logs which are actually important. If you're wondering why Tracker doesn't have metadata about your content, this log perfectly explains why. Without it, we just get bug reports with less information.
Comment 18 zachw 2014-12-17 09:59:09 UTC
(In reply to comment #17)
> (In reply to comment #16)
> > Added log and source pdf to the redhat bug 1148570 a few days ago. Is there
> > more info you need?
> 
> While the patch from Rishi is reasonable, I don't think it's the right
> approach, The right approach is to fix the bug not the logging the bug causes.

What do you think about changing the g_warning ("Sparql update was:\..." line to g_debug() (ie. the patch on 1148570) ?

> Stemming the log to a particular length will ultimately make our job harder in
> the future because we won't have the full details printed. Again, these logs
> should *never* occur, if they do, it's a clear message - IT SHOULD BE FIXED.
> 
> Rishi, did you have more information about the actual bug so we can fix this
> and get it pushed to master / released?
> 
> Finally, I don't know why people are always so interested in removing logs
> which are actually important. If you're wondering why Tracker doesn't have
> metadata about your content, this log perfectly explains why. Without it, we
> just get bug reports with less information.

In this case there are two issues:
1. The system log is a shared resource. If every program logged megabytes as tracker has done, the system log would become useless. More specifically, I would not be able to say "There is something wrong with my system, let me check if I can spot a problem in the log".
2. The nature of the data here is very personal. The 1148570 report refers to religious beliefs and such. It means the log cannot be provided to anyone without careful filtering. If I suffer another bug (not related to tracker) I can't just add my journal output to a bug report now because of the data it contains. I need to filter it.
Comment 19 Martyn Russell 2014-12-17 15:22:05 UTC
(In reply to comment #18)
> (In reply to comment #17)
> > (In reply to comment #16)
> > > Added log and source pdf to the redhat bug 1148570 a few days ago. Is there
> > > more info you need?
> > 
> > While the patch from Rishi is reasonable, I don't think it's the right
> > approach, The right approach is to fix the bug not the logging the bug causes.
> 
> What do you think about changing the g_warning ("Sparql update was:\..." line
> to g_debug() (ie. the patch on 1148570) ?

It has the same net effect as other solutions proposed here, it means we don't see it. Debugging isn't enabled by default and you would only find it after querying why metadata was limited or non-existent for some content.

We shouldn't be trying to hide these problems, we should be fixing them.

I hasten to add, these issues should NEVER HAPPEN. If I see this error message (based on broken SPARQL use), I stop everything and fix it, because it's a simple problem to solve and breaks so badly for many use cases.
 
> > Stemming the log to a particular length will ultimately make our job harder in
> > the future because we won't have the full details printed. Again, these logs
> > should *never* occur, if they do, it's a clear message - IT SHOULD BE FIXED.
> > 
> > Rishi, did you have more information about the actual bug so we can fix this
> > and get it pushed to master / released?
> > 
> > Finally, I don't know why people are always so interested in removing logs
> > which are actually important. If you're wondering why Tracker doesn't have
> > metadata about your content, this log perfectly explains why. Without it, we
> > just get bug reports with less information.
> 
> In this case there are two issues:
> 1. The system log is a shared resource. If every program logged megabytes as
> tracker has done, the system log would become useless. More specifically, I
> would not be able to say "There is something wrong with my system, let me check
> if I can spot a problem in the log".

Let me stop you there.
We don't log megabytes by default. If you get megabytes of warnings, either you're Tracker installation is not set up correctly or your content is severely broken (and this does happen - for example with MP3s that don't follow standards).

The "normal" case here, should be 0 warnings in the logs from Tracker. That's what I have here.

The syslog is rarely a shared resource in my experience and I would question it in your case.

> 2. The nature of the data here is very personal. The 1148570 report refers to
> religious beliefs and such. It means the log cannot be provided to anyone
> without careful filtering. If I suffer another bug (not related to tracker) I
> can't just add my journal output to a bug report now because of the data it
> contains. I need to filter it.

I would say, fix your syslog readability.

I personally don't see why religious believes should require any vetting, but each to their own. I am more interested in the SPARQL that caused the bug, I never read any of the content sent to me (sometimes privately). If you are able to reproduce with another piece of content (file) then that's just as good to me. In fact, often just knowing the extractor module used and the property that caused the problem are enough to investigate and fix.

About adding your journal - "the data it contains" is the reason for the bug, without it, we wouldn't be talking ;)
Comment 20 Elad Alfassa 2014-12-17 16:38:35 UTC
I don't understand. if you have a bug in something making a query, you really don't need the full text of Discworld (or my medical documents, or my paycheck) to debug it, right? So why not limit the output?

"This should not happen" is not a good enough answer. Software breaks frequently. Every program has bugs. But when it breaks it should at least try to fail gracefully, and not hose my system log with hundreds of megabytes of text
Comment 21 Jean-François Fortin Tam 2014-12-17 20:59:29 UTC
Sure "in theory" this shouldn't happen, the software should be perfect and all bugs caught in time before they reach users. In the real world, as you can see, things can break a million ways that do not necessarily occur on the developer's machine, and become security problems.

Why not adopt a "defensive" code approach by default, rather than let it run wild and defer the responsibility to the system logger? Sounds like a risky approach that not only taxes system resources, it becomes an increased attack vector. Kinda like apps that spit passwords or personally identifyable information into debug output...
Comment 22 Martyn Russell 2014-12-18 09:32:46 UTC
(In reply to comment #20)
> I don't understand. if you have a bug in something making a query, you really
> don't need the full text of Discworld (or my medical documents, or my paycheck)
> to debug it, right? So why not limit the output?

Why do you assume we want personal data? That's the rhetoric I feel here.

So we use a limit - what limit do we use? and what do I do when that limit cuts off the SPARQL that causes the actual bug? Set a new limit?

Also, the patch suggested by Rishi still allows for 256 bytes of content to be dumped to the syslog. The way I see it, that's possibility for:
a) not enough information to fix the bug
b) enough information to dump sensitive content into the syslog

Your arguments don't help the Tracker team and don't solve your concerns about privacy either.

> "This should not happen" is not a good enough answer. Software breaks

Actually it is. I maybe fix a few of these a year at most. I don't remember the last time I had to fix an extractor like this situation.

> frequently. Every program has bugs. But when it breaks it should at least try
> to fail gracefully, and not hose my system log with hundreds of megabytes of
> text

Sure there are bugs in all programs.

Frankly, I don't give a shit if your system log is hosed in a unexpected or serious scenario. I care about Tracker operating correctly in the first place. The fact that you are focused on an a side-issue is quite illogical to me.

Finally, we don't fail ungracefully at all. I didn't see an abort() or SEGFAULT anywhere.

If this was happening as a point of regular operation, I would agree with you entirely.

(In reply to comment #21)
> Sure "in theory" this shouldn't happen, the software should be perfect and all
> bugs caught in time before they reach users. In the real world, as you can see,
> things can break a million ways that do not necessarily occur on the
> developer's machine, and become security problems.

Clearly you don't really understand the kind of bug this is.

Let me give you some perspective. This is like using an API that doesn't exist any more. This is quite major. It's not a corner case. It means the ENTIRE SPARQL is thrown out and none of that metadata is available in Tracker. This happens for every file that triggers it.

I would simply prefer to fix it than waste energy replying about how we notice the problem in the first place.
 
> Why not adopt a "defensive" code approach by default, rather than let it run
> wild and defer the responsibility to the system logger? Sounds like a risky
> approach that not only taxes system resources, it becomes an increased attack
> vector. Kinda like apps that spit passwords or personally identifyable
> information into debug output...

Only taxes the syslog (if it is taxing it at all?), if an extractor written by Tracker maintainers is using broken SPARQL (rarely the case) and if the user has broken file(s) (occasionally the case) and if they have many of them (not usually the case).

That's a lot of "if" conditions.
Comment 23 Elad Alfassa 2014-12-18 10:02:18 UTC
I don't think you want my personal data. However, I also don't think my personal data should be dumped into my system log. My point is you don't *need* all this data being dumped there to debug the issue, so why dump all of it there? Why not accept the patch you got here at least until a better solution is available?
Comment 24 Martyn Russell 2014-12-18 10:15:17 UTC
(In reply to comment #23)
> I don't think you want my personal data. However, I also don't think my
> personal data should be dumped into my system log. My point is you don't *need*
> all this data being dumped there to debug the issue, so why dump all of it

So 2 things:

1. The syslog is private to privileged users (normally).
2. I don't know what I need, so trimming what I may need doesn't help, it hinders.

I did talk to Rishi on IRC about this, one solution would be to remove the nie:plainTextContent, but allow everything else. This is essentially the "raw text" of a file, but all other metadata would remain.

This solution at least meets your proposal half way.

If there is still a bug with content not logged, we at least would know the file that causes it and can privately test with it.

> there? Why not accept the patch you got here at least until a better solution
> is available?

Because what tends to happen is, better solutions don't come around and maintainers end up having to work harder to find out what causes problems.
Comment 25 zachw 2014-12-18 11:20:12 UTC
Created attachment 292968 [details]
first ~4MB of /usr/libexec/tracker-extract -v 3 > tracker-extract-log.txt 2>&1
Comment 26 zachw 2014-12-18 11:21:15 UTC
Created attachment 292970 [details]
first file causes failed insert
Comment 27 zachw 2014-12-18 11:43:06 UTC
(In reply to comment #19)
> (In reply to comment #18)
> > (In reply to comment #17)
> > What do you think about changing the g_warning ("Sparql update was:\..." line
> > to g_debug() (ie. the patch on 1148570) ?
> 
> It has the same net effect as other solutions proposed here, it means we don't
> see it. Debugging isn't enabled by default and you would only find it after
> querying why metadata was limited or non-existent for some content.

I have the g_debug() change locally. There are wiki pages for tracker describing how to submit bugs and they recommend to enable debugging output:
https://wiki.gnome.org/Projects/Tracker/Documentation/ReportingBugs
https://wiki.gnome.org/Projects/Tracker/Documentation/Debugging

I've attached the output got by following the instructions. It contains largely the same info as the journal output I attached to 1148570. I also attached the first pdf to cause the failed insert.

Note that the file path is not present in the journal output but is present in the the tracker-extra -v3 output. Without knowing the file path I was only guessing the correct source pdf doc.

Using debug level avoids the problem of deciding how much to limit the debugging output. It does introduce the extra step of turning debugging on, but in tracker's case it appears well documented and easy to do.

> > In this case there are two issues:
> > 1. The system log is a shared resource. If every program logged megabytes as
> > tracker has done, the system log would become useless. More specifically, I
> > would not be able to say "There is something wrong with my system, let me check
> > if I can spot a problem in the log".
> 
> Let me stop you there.
> We don't log megabytes by default. If you get megabytes of warnings, either
> you're Tracker installation is not set up correctly or your content is severely
> broken (and this does happen - for example with MP3s that don't follow
> standards).

I hope you can see from the attachment that you do log megabytes by default. The full output was 8MB, compressed down to ~4MB. This bugzilla limits the file to 1.6MB so I cut off the end.

> I am more interested in the SPARQL that caused the bug, I
> If you are able
> to reproduce with another piece of content (file) then that's just as good to
> me. In fact, often just knowing the extractor module used and the property that
> caused the problem are enough to investigate and fix.
> 

I've added as much info as I can atm (debug messages, source pdf file). This data was already available on 1148570. As I asked before in comment #16, is there more info that you need?

I feel it's likely that the version change caused by upgrading fedora 20 to 21 was the cause of this and I should just clear out any existing tracker data. Do you agree? does tracker version control its DB?
Comment 28 zachw 2014-12-18 12:19:10 UTC
(In reply to comment #27)
> (In reply to comment #19)
> > (In reply to comment #18)
> > Let me stop you there.
> > We don't log megabytes by default. If you get megabytes of warnings, either
> > you're Tracker installation is not set up correctly or your content is severely
> > broken (and this does happen - for example with MP3s that don't follow
> > standards).
> 
> I hope you can see from the attachment that you do log megabytes by default.
> The full output was 8MB, compressed down to ~4MB. This bugzilla limits the file
> to 1.6MB so I cut off the end.

Ugh sorry I misunderstood this.
You meant: By default == when tracker is operating normally.
I thought: By default == without my intervention, as in I did not ask tracker to log verbosely.
Comment 29 Martyn Russell 2014-12-18 14:30:20 UTC
(In reply to comment #26)
> Created an attachment (id=292970) [details]
> first file causes failed insert

Tested with master, I get:

 nco:creator [ a nco:Contact ;
	 nco:fullname "csewards"] ;

Which is quite different to what you get, which is:

 nco:creator [ a nco:Contact ;
	 nco:fullname "Error! No text of specified style in document."] ;

So the bug is either with the PDF extractor version you have there, OR libpoppler-glib that you're using.

It could be that you have an earlier? or older? version to what I am using here. This what I was testing with:

$ ldd libextract-pdf.so|grep -i poppler
	libpoppler-glib.so.8 => /usr/lib/x86_64-linux-gnu/libpoppler-glib.so.8 (0x00007f9c28f66000)
	libpoppler.so.46 => /usr/lib/x86_64-linux-gnu/libpoppler.so.46 (0x00007f9c23fdf000)

$ dpkg -S /usr/lib/x86_64-linux-gnu/libpoppler.so.46
libpoppler46:amd64: /usr/lib/x86_64-linux-gnu/libpoppler.so.46

$ dpkg -S /usr/lib/x86_64-linux-gnu/libpoppler-glib.so.8
libpoppler-glib8:amd64: /usr/lib/x86_64-linux-gnu/libpoppler-glib.so.8

$ dpkg -l libpoppler46:amd64 libpoppler-glib8:amd64 |grep ii
ii  libpoppler-glib8:amd64            0.26.5-2              amd64                 PDF rendering library (GLib-based shared library)
ii  libpoppler46:amd64                0.26.5-2              amd64                 PDF rendering library
Comment 30 Sam Thursfield 2014-12-18 16:14:46 UTC
About the problem of Tracker logging user data into a system log: I don't think it'd ever be possible to write "log anonymiser" code that could filter anything potentially 'private' from a Tracker error message while not removing anything that's needed to debug the problem. So on systems where syslog/journal is considered a shared resource, Tracker shouldn't be logging to the syslog but to somewhere readable only by the current user. This is configurable, I think, so perhaps we need to change the default config, or change how distros configure Tracker logging. Not sure if systemd-journald can do user-private journals -- I think it can.

Putting multiple MBs of data into the log for a single error is obviously wrong, we need to stop doing that.
Comment 31 zachw 2014-12-19 11:01:01 UTC
(In reply to comment #29) 
> It could be that you have an earlier? or older? version to what I am using
> here. This what I was testing with:
> 

[zach@alice task14]$ ldd /usr/lib64/tracker-1.0/extract-modules/libextract-pdf.so | grep poppler
	libpoppler-glib.so.8 => /lib64/libpoppler-glib.so.8 (0x00007f6b90bf3000)
	libpoppler.so.46 => /lib64/libpoppler.so.46 (0x00007f6b8bd4c000)

[zach@alice task14]$ rpm  -qa | grep poppler
poppler-devel-0.26.2-3.fc21.x86_64
poppler-data-0.4.7-2.fc21.noarch
poppler-glib-devel-0.26.2-3.fc21.x86_64
poppler-utils-0.26.2-3.fc21.x86_64
poppler-0.26.2-3.fc21.x86_64
poppler-glib-0.26.2-3.fc21.x86_64
Comment 32 Martyn Russell 2014-12-22 10:33:57 UTC
Rishi, do you have time to write a patch to omit the plainTextContent in the SPARQL from tracker-extract (for logging only)? This would certainly improve the situation for everyone. Thanks.
Comment 33 zachw 2015-01-11 01:53:03 UTC
Created attachment 294257 [details]
tracker extract output after deleting tracker's stored data

Following on from comment #27 I deleted ~/.local/share/tracker and ~/.cache/tracker. The "Unable to insert multiple values" errors are now gone.

There is another error "at most 64 tables in a join" which results in a Sparql update dump. The output of /usr/lib/libexec/tracker-extract is attached. The offending pdf file is too large to attach, it is available here https://drive.google.com/folderview?id=0B70Uicai8GzIUklKNng0a0VfWUU&usp=sharing
Comment 34 Debarshi Ray 2015-01-13 14:02:41 UTC
Created attachment 294433 [details] [review]
libtracker-miner: Restrict the amount of data that is logged for errors

Happy Birthday, Martyn!
Comment 35 Martyn Russell 2015-01-13 14:28:28 UTC
Comment on attachment 294433 [details] [review]
libtracker-miner: Restrict the amount of data that is logged for errors

Looks good to me, thanks Rishi! :D And thank you for the birthday wishes ;)
Comment 36 Debarshi Ray 2015-01-13 14:42:30 UTC
Comment on attachment 294433 [details] [review]
libtracker-miner: Restrict the amount of data that is logged for errors

Pushed to both master and tracker-1.2.
Comment 37 Martyn Russell 2015-01-27 18:57:38 UTC
*** Bug 743028 has been marked as a duplicate of this bug. ***
Comment 38 Martyn Russell 2015-01-27 18:59:34 UTC
*** Bug 741137 has been marked as a duplicate of this bug. ***
Comment 39 Jean-François Fortin Tam 2015-06-25 04:54:03 UTC
I'm running Tracker 1.4.0 on Fedora 22 and I still see it spamming the daylight out of my logs. Looking at the output of journalctl I see a ton of things like 

(tracker-extract:1855): Tracker-WARNING **: Task 9, error: Unable to insert multiple values for subject `urn:uuid:a677fd76-57f5-fa4c-2103-db9d37eb321a' and single valued property `dc:identifier' (old_value: 'URN:ISBN:978-1-9999-430-6', new value: '9780752877402')

(and variants along those lines, with full text and metadata being dumped into the logs each time)

Am I missing something here? AFAICT the versions in Fedora 22 are recent enough to incorporate the fixes above?
Comment 40 Jean-François Fortin Tam 2015-06-25 04:54:26 UTC
*** Bug 747198 has been marked as a duplicate of this bug. ***
Comment 41 Debarshi Ray 2015-06-25 07:10:53 UTC
I continue to be surprised that tracker issues are filed against gnome-session (comment 13).
Comment 42 Jean-François Fortin Tam 2015-06-25 13:56:38 UTC
Ah sorry I didn't realize this bug was filed on gnome-session, I just followed the trail of emails from one bug to another :) does it make sense to re-close this, de-duplicate bug #747199 and consider it the follow-up to this bug though?
Comment 43 Elad Alfassa 2015-06-25 14:04:02 UTC
It was originally filled against tracker, then someone moved it to gnome-session because they thought it's not a tracker issue. This is still a tracker issue. I think we should just move this one back to tracker.
Comment 44 Carlos Garnacho 2015-07-04 00:08:29 UTC
#1. This clearly belongs to tracker, or most of it.

#2. Guys, you're totally barking the wrong tree. Those are very valid tracker errors, each of those represent inconsistencies in your tracker database, extra io/cpu time at each startup, or everything at the same time.

The question here is "why are they happening?", not "OMG THERE'S SO MANY OF THOSE", and the correct procedure is:

1) Identifying the patterns in your log, do they happen for all files of a certain type, only for a certain file? Maybe I get different tracker errors but I now see it all that as SPAM?

2) Getting into specifics, The specific "Unable to insert multiple values..." error can happen on a plethora of properties. Eg. bug #746401 fixed one on nie:identifier for epubs, might fix yours? maybe, probably not. I find it quite amusing that the bug title and description precisely skip the info I'd appreciate there.

So, because this bug is a mess (just "errors" put together without distinction, dups are wrong/bogus...), each of you please recheck your logs, and file individual bugs, with full log excerpts and if possible, example files :). I'll try to make a sense of the attached logs anyway.

With all that, tracker might be fixed so the files issuing those errors are processed silently, once, and never again.

PS. In order to know what file is behind one of such warnings, you can do "tracker info urn:uuid:..." in the command line.
Comment 45 el 2015-07-04 21:25:00 UTC
I think all the complaints about the log are valid. I just checked syslog for tracker, and I see tracker having made a bazillon entries for some cache files *in the same second* because something is wrong with them ("no disk nor store mtime").

Now I don't understand why tracker can't stop after encountering this after 50 files in one millisecond and stop repeating it all over for an additional 1000 files slowing down the whole system with infinite logging. There's limited disk I/O after all and lengthy logging strings aren't for free.

Nobody complains about some errors in the system log, and of course they are needed to debug the software, but tracker just has a tendency to find something wrong with a file or itself about a file in some sort and then *dump an error for 99999 files* possibly including even lengthy file contents.

There MUST be some way to be a bit more moderate with the logging, and still allow you to get enough from them to have a clue what you should fix next of apparently many issues.
Comment 46 Elad Alfassa 2015-07-04 22:11:50 UTC
Right, the errors themselves might be a problem, but it's not why I opened this bug, and it's not what all of us are complaining about.

It bothers me that tracker dumps document contents into my syslog. It's a real issue both for the performance of my system and my privacy. It also means that if I delete my documents I need to make sure to delete the system log too, otherwise they can be recovered from there. 

I have not encountered this personally since it was "fixed", but Jeff is seeing a similar issue so it means it was not fixed in all relevant code paths.

There's no problem with being verbose in the logs if you hide that verbosity behind a --verbose or --debug or --whatever flag, but it's really not a wise default choice. Most people are not going to report any tracker error that appears in their logs, so the verbose debugging output only helps tracker contributors, and no one else.

On a related and slightly worrying note, why is the document contents even in the query (that gets dumped into the logs)? Does tracker store document contents? if so, this is very bad for user privacy, because it means any document they have is stored both in the tracker index and on the filesystem, which means if they delete a document, it's not really deleted (also it's very dangerous if tracker happens to index an encrypted volume and saves the index in a non-encrypted one).
Comment 47 Carlos Garnacho 2015-07-06 01:28:34 UTC
(In reply to Jonas Thiem from comment #45)
> I think all the complaints about the log are valid. I just checked syslog
> for tracker, and I see tracker having made a bazillon entries for some cache
> files *in the same second* because something is wrong with them ("no disk
> nor store mtime").

Again, a warning you should never see. This is something that deserves a separate bug, with an excerpt of logs, and a description of the situations it happens. Yet there's just a brief comment about this in support to a complaint :(

> 
> Now I don't understand why tracker can't stop after encountering this after
> 50 files in one millisecond and stop repeating it all over for an additional
> 1000 files slowing down the whole system with infinite logging. There's
> limited disk I/O after all and lengthy logging strings aren't for free.

You're tripping into something that shouldn't be happening. After this, behavior is considered undefined, the warnings made their purpose to direct your attention to bugzilla, now let's make something useful out of it ;).

FWIW, log size is not a concern that belongs to Tracker. Systemd eg. provides plenty of options to limit journal/bt sizes. 

> 
> Nobody complains about some errors in the system log, and of course they are
> needed to debug the software, but tracker just has a tendency to find
> something wrong with a file or itself about a file in some sort and then
> *dump an error for 99999 files* 

This is definitely an exaggeration.

> possibly including even lengthy file contents.

Debarshi fixed that in January.

> 
> There MUST be some way to be a bit more moderate with the logging, and still
> allow you to get enough from them to have a clue what you should fix next of
> apparently many issues.

Tracker is a complex state machine, with many moving parts and iterative processes, it is impossible to tell at the time of warning whether the message is redundant or not, or whether it is related to prior or future warnings.

On top of this, the input Tracker deals with are user files, this equals to basically random input that we can't ever fully foresee nor replicate. Some errors seen in this bug like the one on the subject are relative to content extraction, that is, relative to files we've never seen nor touched. Some other errors seen in this bug are related to other pieces of the processing, such as file monitoring, or even across several processes. 

In this shape, a "make tracker spam less" bug is moot, borderline RESOLVED INVALID.



(In reply to Elad Alfassa from comment #46)
> Right, the errors themselves might be a problem, but it's not why I opened
> this bug, and it's not what all of us are complaining about.

Only with some help from your side we will be able to fix tracker so your logs are squeaky clean, mine certainly are, and I want the same for you. This is a necessary process to get Tracker rock solid on all input, which we need help for, just because there's so many weird files out there.

OTOH, Silencing things away will mean our software has more unseen bugs that will take longer to fix, we won't conciously go that way to justify sloth.

Tracker doesn't use warning/critical error levels for fun, those are real log-worthy failures, and several of those errors (like the one on the bug title) will make tracker try again at the next restart, which will fail again and again. Silencing Tracker wont fix the fact that what should be a one-time operation is failing on every occasion, with the extra cpu and I/O involved.

> 
> It bothers me that tracker dumps document contents into my syslog. It's a
> real issue both for the performance of my system and my privacy. It also
> means that if I delete my documents I need to make sure to delete the system
> log too, otherwise they can be recovered from there. 
> 
> I have not encountered this personally since it was "fixed", but Jeff is
> seeing a similar issue so it means it was not fixed in all relevant code
> paths.

Jeff's bug report will be surely appreciated.

> 
> There's no problem with being verbose in the logs if you hide that verbosity
> behind a --verbose or --debug or --whatever flag, 

There's already such toggles (try -v [1-3]). We have fairly strict guidelines as to what deserves g_debug/message/warning/critical treatment. Under our point of view, these are disrupting errors that deserve treatment, and the information provided is valuable to us.

> but it's really not a wise
> default choice. Most people are not going to report any tracker error that
> appears in their logs, so the verbose debugging output only helps tracker
> contributors, and no one else.

No, it helps us identify the problem if we get useful bug reports, it doesn't matter who do we get it from.

> 
> On a related and slightly worrying note, why is the document contents even
> in the query (that gets dumped into the logs)? Does tracker store document
> contents? 

Yes, a plain text, munged (unaccented, stemmed, ...) and maybe incomplete representation is stored, so you can search for contents too. Content indexing can be indirectly disabled through the /org/freedesktop/tracker/extract/max-bytes dconf option.

> if so, this is very bad for user privacy, because it means any
> document they have is stored both in the tracker index and on the
> filesystem, which means if they delete a document, it's not really deleted

Well, things are deleted from the database too, Tracker doesn't eat your data forever.

> (also it's very dangerous if tracker happens to index an encrypted volume
> and saves the index in a non-encrypted one).

A valid concern, again to be taken separately to another bug.
Comment 48 Martyn Russell 2015-07-12 11:37:36 UTC
I just want to add to Carlos' points as co-maintainer and having had to respond to many of these reports over the past year or so too.

Carlos is bang on and completely correct in all cases above.

The main concern here for many reporters is that it's an inconvenience or concern. And yes, you should be concerned and it is an inconvenience. These errors mean your data is not be found with Tracker or some files are perhaps broken when trying to index (resulting in no entry for that content of yours). Not to mention that if your database is hosed, you should know about it and/or we should fix it. Sitting there silently would just get us more bug reports and with even less information to go on.

I'm not really saying anything Carlos or myself (in other comments/threads) haven't already said. But since we're the ones fixing the bugs reported, we know what we need to fix them, even if it's an inconvenience to users.

(In reply to Elad Alfassa from comment #46)
> On a related and slightly worrying note, why is the document contents even
> in the query (that gets dumped into the logs)? Does tracker store document
> contents? if so, this is very bad for user privacy, because it means any
> document they have is stored both in the tracker index and on the
> filesystem, which means if they delete a document, it's not really deleted
> (also it's very dangerous if tracker happens to index an encrypted volume
> and saves the index in a non-encrypted one).

I wanted to add to this.

1. As Carlos pointed out, we don't print the entirety of the (file) content related to the issue in syslog since January.

2. The database is stored on the user's home partition (usually ~/.cache/tracker), on the same partition as the standard indexed locations. So I don't grant this assumption that content is less secure. If you have a weird set up for your file system, you can even configure the location of the database if you so wish. Databases are *per user*. If you still think this can be more secure, please file a but about it.

3. As Carlos also said, content isn't kept forever, we are a search engine and indexing tool, keeping an index to content that doesn't exist would be quite broken.

4. The only way the encrypted volume argument even becomes valid, is if you have :
a) Your indexing locations outside of the same partition the database is on (which is $HOME by default)
b) Your indexing locations or database are on different partitions with different encryption semantics (e.g. one partition is encrypted and another isn't). 

As standard, this is not the case and I've yet to hear of such a case. If you have reason or evidence to support otherwise, please file a bug and we will take a look at it.

****

But please, let's look at all issues objectively and separately instead of throwing them into one big bug full of different problems which we can't keep track of properly (if you'll excuse the pun) :)


(In reply to Jean-François Fortin Tam from comment #39)
> I'm running Tracker 1.4.0 on Fedora 22 and I still see it spamming the
> daylight out of my logs. Looking at the output of journalctl I see a ton of
> things like 
> 
> (tracker-extract:1855): Tracker-WARNING **: Task 9, error: Unable to insert
> multiple values for subject `urn:uuid:a677fd76-57f5-fa4c-2103-db9d37eb321a'
> and single valued property `dc:identifier' (old_value:
> 'URN:ISBN:978-1-9999-430-6', new value: '9780752877402')
> 
> (and variants along those lines, with full text and metadata being dumped
> into the logs each time)

So this looks like some extractor using the dc:identifier property is trying to insert more than one value for it. There are only a handful of extractors that can cause this:

  $ git grep -i dc:identifier .
  ...

This bug represents a fundamentally broken extractor, meaning all files using this extractor are likely to fail on extraction. A VERY SERIOUS BUG.

We can't know more without asking you for the details of the files that are being indexed. As Carlos said, you can find out the file information using:

  $ tracker info urn:uuid:...

Finally, when we find out what the file *type* is, we can fix the extractor.

This is just one case in your log file. If there are others, they should be checked and fixed too.

> Am I missing something here? AFAICT the versions in Fedora 22 are recent
> enough to incorporate the fixes above?

Well, as Carlos said, an issue related to this was fixed in January and your log output shows that.

  commit 59745984f91fa0c5e464c312c3a142fe3bd3b853
  Author: Debarshi Ray <debarshir@gnome.org>
  Date:   Tue Jan 13 14:51:50 2015 +0100

But this reduces the logging to not dump the entire content of the failure. It doesn't fix the bug you're seeing.

This is why it's important to have separation between a general purpose "spamming syslog" bug report and the actual issues people are seeing.

To conclude, we can't know if this is fixed or not without more information about your specific case.

So please create a new bug report with the complete log and the output of your "tracker info" command on the UUID reported in the log and we will address that.

As for this particular bug, it's already fixed and your comment even supports that.
Comment 49 Martyn Russell 2015-07-12 19:30:43 UTC
*** Bug 733809 has been marked as a duplicate of this bug. ***