After an evaluation, GNOME has moved from Bugzilla to GitLab. Learn more about GitLab.
No new issues can be reported in GNOME Bugzilla anymore.
To report an issue in a GNOME project, go to GNOME GitLab.
Do not go to GNOME Gitlab for: Bluefish, Doxygen, GnuCash, GStreamer, java-gnome, LDTP, NetworkManager, Tomboy.
Bug 788298 - msoffice-xml extractor sets some properties to be empty strings
msoffice-xml extractor sets some properties to be empty strings
Status: RESOLVED FIXED
Product: tracker
Classification: Core
Component: Miners
git master
Other Linux
: Normal normal
: ---
Assigned To: tracker-general
tracker-general
Depends on:
Blocks:
 
 
Reported: 2017-09-28 17:51 UTC by Sam Thursfield
Modified: 2017-12-18 12:24 UTC
See Also:
GNOME target: ---
GNOME version: ---


Attachments
extract/msoffice-xml: Treat zero-length strings as unset properties (5.43 KB, patch)
2017-09-28 17:52 UTC, Sam Thursfield
committed Details | Review

Description Sam Thursfield 2017-09-28 17:51:34 UTC
Properties like nie:title are being set to "" in some cases. These should just not be set if we don't have a sensible value for them.
Comment 1 Sam Thursfield 2017-09-28 17:52:04 UTC
Created attachment 360620 [details] [review]
extract/msoffice-xml: Treat zero-length strings as unset properties

The MS Office extractor has been producing stuff like this:

    <file:///home/sam/Downloads/spreadsheet.xls> nie:comment "" ;
      nie:contentLastModified "2016-06-13T14:19:50Z" ;
      nie:contentCreated "2016-05-14T10:17:05Z" ;
      nie:plainTextContent "..." ;
      nie:subject "" ;
      a nfo:PaginatedTextDocument ;
      nie:title "" .

This breaks queries which use COALESCE to do things like this:

    SELECT COALESCE(?nie_title, ?filename) as ?title

If ?nie_title is unset then ?title will be set to the contents of
?filename; but if ?nie_title is present and set to an empty string then
?title will set to that empty string, which is not at all useful.

The extractor will now ignore zero-length strings
Comment 2 Carlos Garnacho 2017-10-04 10:44:57 UTC
Comment on attachment 360620 [details] [review]
extract/msoffice-xml: Treat zero-length strings as unset properties

I tend to prefer text[0] != '\0' checks because they're O(1), but won't bikeshed about it :).
Comment 3 Sam Thursfield 2017-10-04 17:11:51 UTC
Review of attachment 360620 [details] [review]:

Thanks! Committed with the suggested change as a1e766cd12610b10617a334489e8d117be337019