GNOME Bugzilla – Bug 713830
Attachments without Content-Disposition are hidden
Last modified: 2014-08-13 19:28:55 UTC
---- Reported by geary-maint@gnome.bugs 2013-11-19 02:30:00 -0800 ---- Original Redmine bug id: 7704 Original URL: http://redmine.yorba.org/issues/7704 Searchable id: yorba-bug-7704 Original author: James Fellows Original description: PDF files are commonly (incorrectly) attached as Content-Type: application /octet-stream. Geary displays the email as if it has no attachments at all - even though the attachment is definitely there. I guess they are being hidden because Geary doesn't know how to open them? Even so, they should not be hidden - hiding them results in potentially embarrassing "you forgot to attach... no I didn't" conversations. When viewed in gmail the attachment is visible. (apologies if this is a duplicate - I searched for attachments and couldn't see anything related). ---- Additional Comments From geary-maint@gnome.bugs 2013-11-20 16:37:00 -0800 ---- ### History #### #1 Updated by Jim Nelson about 19 hours ago * **Priority** changed from _Normal_ to _High_ * **Target version** set to _0.5.0_ I don't think this is a duplicate. Good catch! --- Bug imported by chaz@yorba.org 2013-11-21 20:23 UTC --- This bug was previously known as _bug_ 7704 at http://redmine.yorba.org/show_bug.cgi?id=7704 Unknown milestone "unknown in product geary. Setting to default milestone for this product, "---". Setting qa contact to the default for this product. This bug either had no qa contact or an invalid one. Resolution set on an open status. Dropping resolution
*** Bug 733542 has been marked as a duplicate of this bug. ***
We've had a few reports of this recently. We should give this some attention for 0.8.
*** Bug 733590 has been marked as a duplicate of this bug. ***
Are we sure that the problem is the Content-Type and not the Content-Disposition? As far as I know, the only Content-Type check we make is to find body components and to figure out what to do with inline parts. I don't think we discard parts with bad types. However, the headers in bug #733542 don't specify a Content-Disposition, and we do throw away parts that don't specify their disposition when assembling attachments. (See line 758 of rfc822-message.vala.) Bug #713546 may be relevant. Jim quotes from RFC 2183: Content-Disposition is an optional header field. In its absence, the MUA may use whatever presentation method it deems suitable. I guess not displaying them is technically in compliance, but it probably doesn't really meet the spirit of the RFC.
I'm pretty sure it's a Content-Disposition issue. This works: Content-Type: application/octet-stream Content-Disposition: attachment; filename=aircanada1.pdf Content-Transfer-Encoding: base64 This doesn't: Content-Type: application/pdf; name=aircanada1.pdf Content-Transfer-Encoding: base64
I think you went right to the heart of the problem, Robert. I suspect whatever solution we came up with for the other Content-Disposition problem simply didn't account for PDFs, and probably other media types. I've changed the title of this ticket to better represent the issue.
I've pushed a candidate fix to wip/713830-disposition. This only fixes the problem for new messages; in order to fix the problem for existing email in local storage, we need to do a database upgrade that, in essence, builds all non-existing attachments. A more blunt-force solution is to wipe all attachments and rebuild all. (It's technically not a database upgrade in the sense that there's no schema changes, but we often use the database upgrade system to do attachment fixes.) There's a second problem this patch creates. If a message has an attached image that has no Content-Disposition but is displayed inline by an HTML message (via a cid: URL), the image will be displayed in the message as well as an attachment. This is not a showstopper by any means, but I find it annoying. We have some possible ways forward here: (a) Don't worry about old email, just land the patch as-is and let users deal with it. (b) Perform the database upgrade and rebuild attachments. (c) Do (b) and try to fix the image attachment problem as well. I suspect we could fix it by looking for a Content-ID and assuming its presence means the image is used inline and should be treated as such. However: we don't store the Content-ID in the database, and my investigation suggests we would need to do that in order for the client to determine if the attachment has one. Obviously (c) is a bit more work for questionable payoff, but since (b)'s on the table, I thought I'd bring it up. Right now I'm leaning toward (b). Would be curious to hear others' thoughts.
I'm not happy about (a) so I've started working on (b). Still thinking about (c).
*** Bug 714641 has been marked as a duplicate of this bug. ***
I've pushed up a database upgrade to the wip branch. WARNING: This branch will update your database and make it unusable with Geary in master. Make a backup of your Geary data directory before using. Robert, if it's possible, I'd appreciate it if you could test out the patch. I just need some sanity checking here before I commit. I'm now inclined to go with (b) and ticket (c) for later.
I've tested it briefly, and it seems to work. The code looks reasonable, although I've only glanced over the database upgrade code. I think there's redundancy in the test at line 775 of rfc822-message.vala -- anything that passed the test at line 759 will make it past this as well. We can also remove the comment at line 769; the comment before this block explains it better. I'm not convinced that we should do the database upgrade right now. Presumably, we'll also want to re-calculate the attachments once we fix (c). So if we can do that reasonably quickly, maybe we should hold off on the database upgrade until then.
An alternative approach to (c) would be filter these attachments out in ConversationViewer.should_show_attachment(). This is where we're filtering out inline images. At this point, we've already gone through the HTML in insert_html_markup() and found the mime parts for each cid: image. We could assemble a list of these parts and then blacklist them in should_show_attachment(). But while we're thinking about this, I'll point out another problem that I've noticed: Images that are linked in a HTML part with Content-Disposition: inline show up twice -- once in the HTML and again with our inlining code. We can't fix this in the same way -- we're adding them in earlier, in RFC822.Message.get_body(). But if we kept track of the Content-ID, we could exclude them in a similar fashion.
(In reply to comment #11) > I think there's > redundancy in the test at line 775 of rfc822-message.vala -- anything that > passed the test at line 759 will make it past this as well. We can also remove > the comment at line 769; the comment before this block explains it better. Good eye, I've made these changes in the branch. > I'm not convinced that we should do the database upgrade right now. > Presumably, we'll also want to re-calculate the attachments once we fix (c). > So if we can do that reasonably quickly, maybe we should hold off on the > database upgrade until then. My concern with this is that the Content-ID approach isn't trivial. I'm time-constrained with the amount of work I can do on Geary right now. If the problem had greater urgency I could prioritize it, but as it stands it's more an aesthetic problem than a usability (or data loss) issue. Also, the current upgrade code has the nice advantage of only rebuilding attachments if it detects some or all are missing from a message. In my working database, that was 1400 out of 4100. With a Content-ID upgrade, we need to rebuild all attachments -- there's no easy way to associate existing attachments with the ones in the RFC822 message and simply add the Content-ID to the database. > An alternative approach to (c) would be filter these attachments out in > ConversationViewer.should_show_attachment(). This is where we're filtering out > inline images. At this point, we've already gone through the HTML in > insert_html_markup() and found the mime parts for each cid: image. We could > assemble a list of these parts and then blacklist them in > should_show_attachment(). I went down that route yesterday but hit a snag: how do I associate the attachments in insert_html_markup() with the attachments in should_show_attachment()? The attachments in the former are identified by Content-ID, which aren't available in should_show_attachment(). Filenames, file sizes, etc. are unreliable means of identification. > But while we're thinking about this, I'll point out another problem that I've > noticed: Images that are linked in a HTML part with Content-Disposition: > inline show up twice -- once in the HTML and again with our inlining code. We > can't fix this in the same way -- we're adding them in earlier, in > RFC822.Message.get_body(). But if we kept track of the Content-ID, we could > exclude them in a similar fashion. Gah, I didn't catch that in my testing. This elevates the need for Content-ID in Geary.Attachment. Maybe we should just bite the bullet on this one. Let me take a look.
(In reply to comment #13) > Also, the current upgrade code has the nice advantage of only rebuilding > attachments if it detects some or all are missing from a message. I didn't appreciate this before. > > But while we're thinking about this, I'll point out another problem that I've > > noticed: Images that are linked in a HTML part with Content-Disposition: > > inline show up twice -- once in the HTML and again with our inlining code. We > > can't fix this in the same way -- we're adding them in earlier, in > > RFC822.Message.get_body(). But if we kept track of the Content-ID, we could > > exclude them in a similar fashion. > > Gah, I didn't catch that in my testing. This elevates the need for Content-ID > in Geary.Attachment. Maybe we should just bite the bullet on this one. Let me > take a look. I wasn't clear on this point, but this is not a new bug. I've noticed it for a while now, but never quite worked out what was going on. Figuring out the problem here clued me in to what was going on there. Given all this, I don't see any reason to hold off on the existing fix. Then we can make a big Content-ID bug to take care of these other problems.
I bit the bullet and did the Content-ID upgrade. It really wasn't as big a code change as I expected, at least not from the previous work on the branch. Prior warnings about using this branch continue to exist. These changes mean: * Content-ID (and Content-Description) are now stored in attachment table. * Database upgrade is a full rebuild of all attachments, not a partial. * Conversation viewer now stores a hash table of Content-IDs as they are displayed inline, which then excludes them from the attachment list. * We could consider using the Content-Description as the tooltip for the attachment icons, but not vital right now. * If you have any other suggestions for attachment metadata we should store in the database, now is the time to suggest them. * I don't have an available repro case for the inline image bug you mentioned. Can you test to see if this fixes it? If not, is it something easily fixed now w/ Content-ID available?
(In reply to comment #15) > * I don't have an available repro case for the inline image bug you mentioned. > Can you test to see if this fixes it? If not, is it something easily fixed now > w/ Content-ID available? This doesn't fix this problem, but it should make a fix possible. I don't see a clever solution at the moment. What we need to do is exclude images referenced by their CID from being included in the body by inline_image_replacer. But we don't know which images are references until we have the full HTML, which we only have after inline_image_replacer has run. I think we're going to have to come back after the CID replacement has occurred and remove or hide those inlined images that the inliner put in. If we expand the inline_image_replacer to include the CID in the inserted element, this wouldn't be too bad. I can give this a go if you like.
I'd certainly appreciate it if you could.
I've just pushed a commit to the WIP branch that solves the problem for me. I wasn't quite sure where to store the Content-ID for inlined images. I worried about polluting the class list, and I don't know that the CID value will qlways be a valid class name. Using the id was a possibility, but I worried that forwarding could result in multiple images with the same CID. This probably wouldn't be a problem, but I decided to introduce my own `cid` attribute for this purpose. If that bothers you, we can switch to a class or id mechanism.
Should the content_id string be escaped before being injected into the DOM? That's the only concern I have with this approach. Otherwise, I'm good with this.
We shouldn't need to do any escaping, since the CID should already have the same escaping as URLs. But it's entirely possible that we'll get an invalid CID with a quotation mark or something in it that will screw us up. Escaping or encoding the CID won't hurt anything.
I've seen some pretty goofy Message-IDs out there, I wouldn't trust Content-IDs to always be right either. I pushed a patch that encodes the CID's. I also noticed in one of my test emails a GLib warning about null pointer; I fixed that as well in the same branch. I *think* we have a complete branch here, but could use a reality check. Do you think there's anything else I/we need to address before merging into master?
This seems to be working. Two minor changes to consider: 1) We might add a comment to the removal code explaining that we don't need to escape the CID there. That it worked without this surprised me initially. 2) I had guessed that query_selector would throw an error if it couldn't find that element, but apparently it actually returns a null. As a result, we should add a debug or warning to the catch statement. Alternatively, we could remove that try / catch block and rely on the outer block. I can't think of anything else we need to do with the content ID, but the fact that it's going into the database means that we can make future fixes without a database version bump.
Robert, can you push these changes? I would appreciate it. Both seem like a good idea. > As a result, we > should add a debug or warning to the catch statement. Alternatively, we could > remove that try / catch block and rely on the outer block. Let's stick with try/catch block as-is and add the debug message as you suggested.
I got worried that it was working not because escaping wasn't needed in the search, but because there wasn't anything to escape in my test cases. Using the JS console, it looks like escaping is needed both on the insertion and removal. So I've added that, as well as adding a debug statement in the try/catch block.
Thanks, Robert. I've been running this branch for a day or so and want to run it over the weekend (in the office and at home). Since it's such a painful upgrade, I want to make sure we've not exposed ourselves to anything else before committing to master. If all goes well, I'll probably merge this early next week. If you find anything before then, please let me know.
Pushed to master, commit a9db6e8c. Also see commits 787d572, 64fdb80, and 8ab89a.
Thanks for working on this! After upgrading from 2bdd67c to 8ab89a6, I see from the logs that geary re-syncs my three three accounts. However, I get the following error whenever I try to open a message: >>>> [deb] 16:17:03 0,000369 imap-engine-email-prefetcher.vala:238: Error prefetching 1 emails for Other:mhu@es.aau.dk:INBOX: (Statement.ctor /home/hundeboll/.local/share/geary/mhu@es.aau.dk/geary.db) [err=1] - no such column: content_id ( SELECT id, filename, mime_type, filesize, disposition, content_id, description FROM MessageAttachmentTable WHERE message_id = ? ORDER BY id ) <<<< When I get home to a cabled internet connection, I will try to move my databases from .local/share/geary/*/geary.db to see if a complete rebuild fixes this.
The new .sql file was not being installed, causing this problem. I just pushed a patch that resolves this issue. Please pull from master and try again.