Visualizing Diffs: The Myers difference algorithm (2020)

leetrout · on Nov 3, 2022

This made me think of a couple other interesting things:

1. you can change which algorithm is used in git diff as multiple are supported

https://luppeng.wordpress.com/2020/10/10/when-to-use-each-of...

2. Google has an edit graph implementation in Go in the cmp package

https://github.com/google/go-cmp/blob/master/cmp/internal/di...

aidos · on Nov 3, 2022

Even slightly further off topic, but probably useful in this context. Another top tip in git is that you can use `--color-moved=dimmed-zebra` to get a sense of code that was just moved but unchanged. More importantly, it directs you to code that differs in the midst of a move.

I used it yesterday to spot where a refactor that was moving code around lost changes during a rebase on master.

Gehinnn · on Nov 3, 2022

I just worked on a new diff algorithm for VS Code, in particular for the new merge editor (you can turn it on with "diffEditor.diffAlgorithm": "experimental").

Diffing is hard and there are many subtleties (e.g. https://github.com/microsoft/vscode/issues/164370).

The space efficient Myers diff algorithm has some problems when using it for merging, as tiny changes at the beginning of a file can cause a different path selection later on. Also, Myers doesn't minimize the number of hunks (only their total length), which sometimes leads to unnatural outcomes.

If anybody has good ideas for more advanced diffing algorithms, I'm all ears! (maybe even tree based ones, like difftastic) It's already very difficult to objectively evaluate the quality of a diff or to see if improving the diff for a special case doesn't make the average case worse.

As for the visualization of Myers, I can recommend to have a look at https://blog.robertelder.org/diff-algorithm/.

DannyBee · on Nov 3, 2022

Given there is no "truth" (IE diff algorithms don't reproduce what you actually did, just how to transform one thing into the other), what is the goal you have for your diff algorithm?

Size of diff? Speed of diff? etc

That would be helpful in trying to give you alternative sources :)

aidos · on Nov 3, 2022

Looking at OPs github issue you can clearly see the problem and what they're trying to solve.

    .., observableValue, -> .., observable[Event, observable]Value,
    .., observableValue, -> .., [observableEvent, ]observableValue,

The new algorithm better represents the change that the coder actually made.

More generally, I'd say that the point of diffing in something like VSCode is to try to aide the reviewer (for any number of types of reviewer - looking at merge conflicts etc) in evaluating a change.

It's a tool for human understanding, and from that point of view, text diffing is kinda barbaric (don't get me wrong, I love it and use it all day long). You can rename a variable using an LSP, knowing that it's effectively a safe no-op on the code. But once it comes to review the PR, that change is just additional noise mixed in with everything else.

I use Neovim, where I have treesitter. I look at the amazing information in there and long for the day when I'll be able to use its power for reviews. Something like I described above - show me a version of the change with stuff that doesn't matter, like renames, removed. Now show me a diff that's the inverse of that, so I can skim it. I mentioned in another comment that `--color-moved=dimmed-zebra` is a useful tool to help focus on what matters.

Gehinnn · on Nov 3, 2022

It is a very interesting idea to decompose a diff into two diffs - one part that a computer understands (like renames, formatting changes), and one part that a computer does not understand, but that is as small as possible!

The referenced issue is only one example. There are many more (e.g. https://github.com/microsoft/vscode/issues/164574).

Gehinnn · on Nov 3, 2022

In case of the diff viewer in VS Code, the goal is to present the changes to the user in a helpful way, so that the user quickly understands what changed.

In case of the merge editor, the diffs should reflect the actual edits as close as possible, so that non-conflicting edits cause non-conflicting diffs, which then allows a non-conflicting merge.

samsquire · on Nov 3, 2022

I implemented a three direction diff in Python at https://github.com/samsquire/text-diff

It's not ready for app or editor usage, but it's a POC.

Conflict detection is fairly hard to get right, I can detect overlapping ranges to colour the output green or red.

Using this guide to implement the algorithm. https://blog.jcoglan.com/2017/02/12/the-myers-diff-algorithm...

jcynix · on Nov 3, 2022

Nice animation. Diff is a very helpful tool, as is the patch command working on diff's output. It took me some time as a student (a long time ago) to really comprehend what they where doing, as it looked like magic when I first encountered them while learning Unix.

As an emacs user, I prefer its builtin ediff feature, which can work like an interactive kind of patch:

https://www.gnu.org/software/emacs/manual/html_mono/ediff.ht...

DonHopkins · on Nov 3, 2022

Speaking of Emacs:

https://news.ycombinator.com/item?id=22849522

DonHopkins on April 12, 2020 | parent | context | favorite | on: Enemy AI: chasing a player without Navigation2D or...

James Gosling's Emacs screen redisplay algorithm also used similar "dynamic programming techniques" to compute the minimal cost path through a cost matrix of string edit operations (the costs depended i.e. on the number of characters to draw, length of the escape codes to insert/delete lines/characters, padding for slow terminals, etc). https://en.wikipedia.org/wiki/Gosling_Emacs

>Gosling Emacs was especially noteworthy because of the effective redisplay code, which used a dynamic programming technique to solve the classical string-to-string correction problem. The algorithm was quite sophisticated; that section of the source was headed by a skull-and-crossbones in ASCII art, warning any would-be improver that even if they thought they understood how the display code worked, they probably did not.

https://donhopkins.com/home/archive/emacs/skull-and-crossbon...

Trivia: That "Skull and Crossbones" ASCII art is originally from Brian Reid's Scribe program, and is not copyrighted.

https://donhopkins.com/home/archive/emacs/mw/display.c

    /*  1   2   3   4   ....            Each Mij represents the minumum cost of
          +---+---+---+---+-----        rearranging the first i lines to map onto
        1 |   |   |   |   |             the first j lines (the j direction
          +---+---+---+---+-----        represents the desired contents of a line,
        2 |   |  \| ^ |   |             i the current contents).  The algorithm
          +---+---\-|-+---+-----        used is a dynamic programming one, where
        3 |   | <-+Mij|   |             M[i,j] = min( M[i-1,j],
          +---+---+---+---+-----                      M[i,j-1]+redraw cost for j,2
        4 |   |   |   |   |                           M[i-1,j-1]+the cost of
          +---+---+---+---+-----                        converting line i to line j);
        . |   |   |   |   |             Line i can be converted to line j by either
        .                               just drawing j, or if they match, by moving
        .                               line i to line j (with insert/delete line)
     */

https://donhopkins.com/home/documents/EmacsRedisplayAlgorith...

https://dl.acm.org/doi/10.1145/1159890.806463

A redisplay algorithm, by James Gosling.

Abstract

This paper presents an algorithm for updating the image displayed on a conventional video terminal. It assumes that the terminal is capable of doing the usual insert/delete line and insert/delete character operations. It takes as input a description of the image currently on the screen and a description of the new image desired and produces a series of operations to do the desired transformation in a near-optimal manner. The algorithm is interesting because it applies results from the theoretical string-to-string correction problem (a generalization of the problem of finding a longest common subsequence), to a problem that is usually approached with crude ad-hoc techniques.

[...]

6. Conclusion

The redisplay algorithm described in this paper is used in an Emacs-like editor for Unix and a structure editor. It's performance has been quite good: to redraw everything on the screen (when everything has changed) takes about 0.12 seconds CPU time on a VAX 11/780 running Unix. Using the standard file typing program, about 0.025 seconds of CPU time are needed to type one screenful of text. Emacs averages about 0.004 CPU seconds per keystroke (with one call on the redisplay per keystroke).

Although in the interests of efficency we have stripped down algorithm 5 to algorithm 6 the result is still an algorithm which has a firm theoretical basis and which is superior to the usual ad-hoc approach.

carapace on April 12, 2020 [–]

I wonder how the ratio of display overhead to cost to compute the minimal cost path has changed over the years? At some point it must be cheaper to do a "dumb" redraw if the display hardware is fast enough?

Also, it seems like this might also be useful for computing minimal diffs, eh?

DonHopkins on April 12, 2020 | parent [–]

It was definitely worth it if you had a 300 baud modem and a lightly loaded VAX mostly to yourself. But it became an overkill with higher baud rates and high speed networks.

It's based on the "string-to-string correction problem" (edit distance / Levenshtein distance), combined with "dynamic programming" (finding the best cost path through a cost matrix), which is (kind of) how diff works, on a line-by-line basis for inserted and deleted lines, and then on a character-by-character basis between changed lines. Each "edit" has a "cost" determined by the number of characters you have to send to the terminal (length of the escape codes to perform an edit in multiple possible ways, versus just redrawing the characters). Emacs computes its way through a twisty maze of little escape codes each time it updates the screen.

H. M. Wagner and M. J. Fischer. "The string-to-string correction problem." JACM 21, 1 (January 1974), 168-173

Citation: https://dl.acm.org/doi/10.1145/321796.321811

PDF: https://dl.acm.org/doi/pdf/10.1145/321796.321811?download=tr...

https://en.wikipedia.org/wiki/String-to-string_correction_pr...

>In computer science, the string-to-string correction problem refers to determining the minimum number of edit operations necessary to change one string into another (i.e., computing the shortest edit distance). A single edit operation may be changing a single symbol of the string into another, deleting, or inserting a symbol. The length of the edit sequence provides a measure of the distance between the two strings.

https://en.wikipedia.org/wiki/Diff

>In computing, the diff utility is a data comparison tool that calculates and displays the differences between two files. Unlike edit distance notions used for other purposes, diff is line-oriented rather than character-oriented, but it is like Levenshtein distance in that it tries to determine the smallest set of deletions and insertions to create one file from the other. The diff command displays the changes made in a standard format, such that both humans and machines can understand the changes and apply them: given one file and the changes, the other file can be created.

https://en.wikipedia.org/wiki/Levenshtein_distance

>In information theory, linguistics and computer science, the Levenshtein distance is a string metric for measuring the difference between two sequences. Informally, the Levenshtein distance between two words is the minimum number of single-character edits (insertions, deletions or substitutions) required to change one word into the other. It is named after the Soviet mathematician Vladimir Levenshtein, who considered this distance in 1965.

https://en.wikipedia.org/wiki/Edit_distance

>In computational linguistics and computer science, edit distance is a way of quantifying how dissimilar two strings (e.g., words) are to one another by counting the minimum number of operations required to transform one string into the other. Edit distances find applications in natural language processing, where automatic spelling correction can determine candidate corrections for a misspelled word by selecting words from a dictionary that have a low distance to the word in question. In bioinformatics, it can be used to quantify the similarity of DNA sequences, which can be viewed as strings of the letters A, C, G and T.

Here's a Levenshtein demo that shows you the cost matrix with the shortest path highlighted. Try calculating the distance between "fotomat" and "tomato", and you can see the shortest path which has a cost of 3 edits (1: delete f at beginning, 2: delete o at beginning, 3: insert o at end):

http://www.let.rug.nl/kleiweg/lev/

>Levenshtein distance is obtained by finding the cheapest way to transform one string into another. Transformations are the one-step operations of (single-phone) insertion, deletion and substitution. In the simplest versions substitutions cost two units except when the source and target are identical, in which case the cost is zero. Insertions and deletions costs half that of substitutions. This demonstration illustrates a simple algorithm which basically looks at all of the different ways for operations to transform one string to another.

adzm · on Nov 3, 2022

For those curious about the subtle reference to poetry plagiarism: https://www.theguardian.com/books/2013/may/22/plagiarism-sca...

_ZeD_ · on Nov 3, 2022

I cannot suggest often enough also to look at the difflib implementation in the python stdlib

https://docs.python.org/3/library/difflib.html#difflib.Seque...

https://github.com/python/cpython/blob/main/Lib/difflib.py