Bug 587410 – Bad diff highlighting

After an evaluation, GNOME has moved from Bugzilla to GitLab. Learn more about GitLab.
No new issues can be reported in GNOME Bugzilla anymore.
To report an issue in a GNOME project, go to GNOME GitLab.
Do not go to GNOME Gitlab for: Bluefish, Doxygen, GnuCash, GStreamer, java-gnome, LDTP, NetworkManager, Tomboy.

Bug 587410 - Bad diff highlighting


Summary:	Bad diff highlighting


Status:	RESOLVED FIXED

Product:	meld
Classification:	Other
Component:	filediff
Version:	1.3.x
Hardware:	Other Linux

Importance:	Normal normal
Target Milestone:	---
Assigned To:	Stephen Kennedy
QA Contact:	Stephen Kennedy

URL:
Whiteboard:

Duplicates:	633161 654283 (view as bug list)
Depends on:
Blocks:

Reported:	2009-06-30 10:04 UTC by Ross Burton
Modified:	2012-11-09 21:35 UTC

See Also:
GNOME target:	---
GNOME version:	---

Attachments
1-a (339 bytes, application/octet-stream) 2009-06-30 10:04 UTC, Ross Burton		Details
1-b (459 bytes, application/octet-stream) 2009-06-30 10:04 UTC, Ross Burton		Details
First pass patch (2.34 KB, patch) 2009-07-01 22:12 UTC, Stephen Kennedy	none	Details \| Review
Full patch (4.64 KB, patch) 2009-09-21 12:24 UTC, Piotr Piastucki	reviewed	Details \| Review
an example of the higlighting bug on a very simple set of data (22.97 KB, application/zip) 2010-03-05 12:52 UTC, lokycqnqapbrxr		Details

Description Ross Burton 2009-06-30 10:04:04 UTC

Attached are two files.  Do a diff with 1-a on the left and 1-b on the right, and notice the bad highlighting of const gchar *tzid on the last line.

Comment 1 Ross Burton 2009-06-30 10:04:27 UTC

Created attachment 137607 [details]
1-a

Comment 2 Ross Burton 2009-06-30 10:04:39 UTC

Created attachment 137608 [details]
1-b

Comment 3 Piotr Piastucki 2009-07-01 19:20:55 UTC

Actually, the issue is caused by all these performance optimizations in difflib.SequenceMatcher class that cannot be switched off.
If the chunk is big ( >= 200 elements) then the most popular elements (characters in this case) will be removed from the list of elements taken into consideration when searching for the longest common sequence. Unfortunately, it means that if there are 200 characters in the chunk then only characters which appear no more than twice will be considered non-popular and used to determine the longest sequence (!) Interesting why people still like to hardcode such magic numbers.

Possible workaround - split chunks into lines and process them separately. This will work for lines shorter than 200 characters only.

See the following excerpt from SequenceMatcher for more details (bj2 stores indices used to perform longest sequence search):

    def __chain_b(self):
        b = self.b
        n = len(b)
        self.b2j = b2j = {}
        populardict = {}
        for i, elt in enumerate(b):
            if elt in b2j:
                indices = b2j[elt]
                if n >= 200 and len(indices) * 100 > n:
                    populardict[elt] = 1
                    del indices[:]
                else:
                    indices.append(i)
            else:
                b2j[elt] = [i]

        # Purge leftover indices for popular elements.
        for elt in populardict:
            del b2j[elt]

Comment 4 Stephen Kennedy 2009-07-01 21:53:41 UTC

One thing I always wanted to try was to tokenize both blocks before running the inline diff to get coarse changes, then run a diff again on the mismatched tokens for the fine grained changes. This would push the complexity ceiling up somewhat and may be better suited to programming text changes.

Comment 5 Stephen Kennedy 2009-07-01 22:12:21 UTC

Created attachment 137708 [details] [review]
First pass patch

After that comment, I did a quick test to see what a simple regex split would do in this case. The initial results aren't too bad.

Comment 6 Kai Willadsen 2009-07-02 06:51:31 UTC

Splitting into words prior to doing the character-by-character comparison sounds like a good idea. I think that tokenizing might actually *speed up* highlighting on average, as we're going to get fewer overly long sequences to be diffed.

Testing out the patch, the results look good. Should the regexes be r"(?u)(\W)" so that e.g., café doesn't get split into two tokens? Obviously an edge case...

Comment 7 Piotr Piastucki 2009-07-02 19:52:04 UTC

I have tested the patch a bit and it works really good indeed.

Comment 8 Stephen Kennedy 2009-07-11 12:39:43 UTC

Yes the regex should be r"(?u)(\W)".

One other thing needed is to recurse into the mismatching blocks to get the detailed inline changes. This patch only gets the coarse "word" level changes.

Comment 9 Piotr Piastucki 2009-09-21 12:24:08 UTC

Created attachment 143589 [details] [review]
Full patch

I am attaching a full patch that contains both coarse- and fine-grained passes. The patch is based on the most recent code and takes advantage of the inline diff caching introduced in filediff.py.

Comment 10 Kai Willadsen 2009-09-30 23:56:52 UTC

Review of attachment 143589 [details] [review]:

(Trying out splinter for reviewing)

This patch seems to produce fewer bad edge case behaviours, but has at least one very common problem: the inline highlighting will pick and choose letters from a line to match a single word on the other side. More specific comments inline.

::: meld/filediff.py
@@ +79,3 @@
+    def __init__(self):
+        CachedSequenceMatcher.__init__(self)
+        self.fineMatcher = CachedSequenceMatcher()

I would have expected that fineMatcher is only likely to be run over short sequences. If that's the case, then caching may well be overkill.

Also, if we get rid of caching here, then there is no longer any need for the normal CachedSequenceMatcher, and the tokenising functionality can be rolled in.

@@ +90,3 @@
+                s1 = len("".join(tokens1[:o[1]]))
+                s2 = len("".join(tokens2[:o[3]]))
+                for fo in self.fineMatcher("".join(tokens1[o[1]:o[2]]), "".join(tokens2[o[3]:o[4]])):

I don't understand why these are being re-joined, particularly without a separator. I suspect that this is what is causing the odd inter-word matching.

@@ +95,3 @@
+        self.cache[(text1, textn)] = [opcodes, time.time()]
+        return opcodes
+

So this seems like it might be doing more (or possibly less in some cases) work than we used to. Do you have any idea whether it's significantly slower/faster?

@@ +710,3 @@
 
+                    text1 = "\n".join(alltexts[1][c[1]:c[2]] )
+                    textn = "\n".join(alltexts[i*2][c[3]:c[4]] )

I don't know what the previous encode/unpack was doing. I guess it's okay to remove? but it would be nice to know why it was there to start with. Stephen?

Comment 11 Piotr Piastucki 2009-10-01 06:16:09 UTC

>I don't understand why these are being re-joined, particularly without a separator. I suspect that this is what is causing the odd inter-word matching.

This is actually a correct behaviour of a sequence matcher. It tries to find the longest common sequence after all :)  
To make the highlighting look more "reasonable" the following condition is used in meld:
if (o[2]-o[1] < 3) or (o[4]-o[3] < 3):

I cannot see any problem with forcing the matching sequences to be at least 3 letters long in the patch too.

I have not noticed any significant performance difference between the current code and the patched one. Sometimes it is a bit slower sometimes a bit faster, usually the difference is within 10ms per edit (add letter, delete letter, add line, delete line).

Comment 12 lokycqnqapbrxr 2010-03-05 12:52:00 UTC

Created attachment 155303 [details]
an example of the higlighting bug on a very simple set of data

Step 1: compare a.txt and b.txt - result: 1.png
Step 2: replace 2's with dashes - result: 2.png
Step 2: del any char from a.txt - result: 3.png

Comment 13 Kai Willadsen 2010-10-06 20:23:35 UTC

I've just pushed a change that moves to using MyersSequenceMatcher to do our inline highlighting. This seems to give significant improvements, and handles both of the test cases here reasonably.

However, we could still do a lot better. The files 1-a and 1-b above highlight relatively sanely, but there is still weird word-breaking behaviour evident. I'm leaving this open for the token-based diff patches and such.

Comment 14 Piotr Piastucki 2010-10-20 20:00:58 UTC

Please keep in mind that this change slows down meld significantly e.g. run ./meld ../meld/filediff.py ../meld/meldapp.py to see the difference.
It may be a good idea to add a switch to turn inline highlighting on and off or to decide which algorithm should be used (quick/accurate). 
Moreover, the chunk size which automatically switches inline highlighting off may need to be decreased.
WinMerge for instance provides 3 modes of inline highlighting:
1) disabled
2) word-based (similar to Stephen's patch)
3) character-based

Comment 15 Kai Willadsen 2010-11-22 20:54:40 UTC

*** Bug 633161 has been marked as a duplicate of this bug. ***

Comment 16 Kai Willadsen 2011-07-10 21:10:41 UTC

*** Bug 654283 has been marked as a duplicate of this bug. ***

Comment 17 Piotr Piastucki 2012-11-09 09:32:11 UTC

I think this issue has already been fixed in meld 1.6. Please let me know if there is anything else that should be improved in the diff algorithm and I will take a look.

Comment 18 Kai Willadsen 2012-11-09 21:35:31 UTC

No, I think we're good here. The 1.6 performance was fine, and between your new k-mer matcher and the threading, 1.7 performance is looking really good. I'm closing the bug.