GNOME Bugzilla – Bug 788298
msoffice-xml extractor sets some properties to be empty strings
Last modified: 2017-12-18 12:24:40 UTC
Properties like nie:title are being set to "" in some cases. These should just not be set if we don't have a sensible value for them.
Created attachment 360620 [details] [review] extract/msoffice-xml: Treat zero-length strings as unset properties The MS Office extractor has been producing stuff like this: <file:///home/sam/Downloads/spreadsheet.xls> nie:comment "" ; nie:contentLastModified "2016-06-13T14:19:50Z" ; nie:contentCreated "2016-05-14T10:17:05Z" ; nie:plainTextContent "..." ; nie:subject "" ; a nfo:PaginatedTextDocument ; nie:title "" . This breaks queries which use COALESCE to do things like this: SELECT COALESCE(?nie_title, ?filename) as ?title If ?nie_title is unset then ?title will be set to the contents of ?filename; but if ?nie_title is present and set to an empty string then ?title will set to that empty string, which is not at all useful. The extractor will now ignore zero-length strings
Comment on attachment 360620 [details] [review] extract/msoffice-xml: Treat zero-length strings as unset properties I tend to prefer text[0] != '\0' checks because they're O(1), but won't bikeshed about it :).
Review of attachment 360620 [details] [review]: Thanks! Committed with the suggested change as a1e766cd12610b10617a334489e8d117be337019