Even slightly further off topic, but probably useful in this context. Another top tip in git is that you can use `--color-moved=dimmed-zebra` to get a sense of code that was just moved but unchanged. More importantly, it directs you to code that differs in the midst of a move.
I used it yesterday to spot where a refactor that was moving code around lost changes during a rebase on master.
I just worked on a new diff algorithm for VS Code, in particular for the new merge editor (you can turn it on with "diffEditor.diffAlgorithm": "experimental").
The space efficient Myers diff algorithm has some problems when using it for merging, as tiny changes at the beginning of a file can cause a different path selection later on. Also, Myers doesn't minimize the number of hunks (only their total length), which sometimes leads to unnatural outcomes.
If anybody has good ideas for more advanced diffing algorithms, I'm all ears! (maybe even tree based ones, like difftastic)
It's already very difficult to objectively evaluate the quality of a diff or to see if improving the diff for a special case doesn't make the average case worse.
Given there is no "truth" (IE diff algorithms don't reproduce what you actually did, just how to transform one thing into the other), what is the goal you have for your diff algorithm?
Size of diff?
Speed of diff?
etc
That would be helpful in trying to give you alternative sources :)
The new algorithm better represents the change that the coder actually made.
More generally, I'd say that the point of diffing in something like VSCode is to try to aide the reviewer (for any number of types of reviewer - looking at merge conflicts etc) in evaluating a change.
It's a tool for human understanding, and from that point of view, text diffing is kinda barbaric (don't get me wrong, I love it and use it all day long). You can rename a variable using an LSP, knowing that it's effectively a safe no-op on the code. But once it comes to review the PR, that change is just additional noise mixed in with everything else.
I use Neovim, where I have treesitter. I look at the amazing information in there and long for the day when I'll be able to use its power for reviews. Something like I described above - show me a version of the change with stuff that doesn't matter, like renames, removed. Now show me a diff that's the inverse of that, so I can skim it. I mentioned in another comment that `--color-moved=dimmed-zebra` is a useful tool to help focus on what matters.
It is a very interesting idea to decompose a diff into two diffs - one part that a computer understands (like renames, formatting changes), and one part that a computer does not understand, but that is as small as possible!
In case of the diff viewer in VS Code, the goal is to present the changes to the user in a helpful way, so that the user quickly understands what changed.
In case of the merge editor, the diffs should reflect the actual edits as close as possible, so that non-conflicting edits cause non-conflicting diffs, which then allows a non-conflicting merge.
Nice animation. Diff is a very helpful tool, as is the patch command working on diff's output. It took me some time as a student (a long time ago) to really comprehend what they where doing, as it looked like magic when I first encountered them while learning Unix.
As an emacs user, I prefer its builtin ediff feature, which can work like an interactive kind of patch:
DonHopkins on April 12, 2020 | parent | context | favorite | on: Enemy AI: chasing a player without Navigation2D or...
James Gosling's Emacs screen redisplay algorithm also used similar "dynamic programming techniques" to compute the minimal cost path through a cost matrix of string edit operations (the costs depended i.e. on the number of characters to draw, length of the escape codes to insert/delete lines/characters, padding for slow terminals, etc).
https://en.wikipedia.org/wiki/Gosling_Emacs
>Gosling Emacs was especially noteworthy because of the effective redisplay code, which used a dynamic programming technique to solve the classical string-to-string correction problem. The algorithm was quite sophisticated; that section of the source was headed by a skull-and-crossbones in ASCII art, warning any would-be improver that even if they thought they understood how the display code worked, they probably did not.
/* 1 2 3 4 .... Each Mij represents the minumum cost of
+---+---+---+---+----- rearranging the first i lines to map onto
1 | | | | | the first j lines (the j direction
+---+---+---+---+----- represents the desired contents of a line,
2 | | \| ^ | | i the current contents). The algorithm
+---+---\-|-+---+----- used is a dynamic programming one, where
3 | | <-+Mij| | M[i,j] = min( M[i-1,j],
+---+---+---+---+----- M[i,j-1]+redraw cost for j,2
4 | | | | | M[i-1,j-1]+the cost of
+---+---+---+---+----- converting line i to line j);
. | | | | | Line i can be converted to line j by either
. just drawing j, or if they match, by moving
. line i to line j (with insert/delete line)
*/
This paper presents an algorithm for updating the image displayed on a conventional video terminal. It assumes that the terminal is capable of doing the usual insert/delete line and insert/delete character operations. It takes as input a description of the image currently on the screen and a description of the new image desired and produces a series of operations to do the desired transformation in a near-optimal manner. The algorithm is interesting because it applies results from the theoretical string-to-string correction problem (a generalization of the problem of finding a longest common subsequence), to a problem that is usually approached with crude ad-hoc techniques.
[...]
6. Conclusion
The redisplay algorithm described in this paper is used in an Emacs-like editor for Unix and a structure editor. It's performance has been quite good: to redraw everything on the screen (when everything has changed) takes about 0.12 seconds CPU time on a VAX 11/780 running Unix. Using the standard file typing program, about 0.025 seconds of CPU time are needed to type one screenful of text. Emacs averages about 0.004 CPU seconds per keystroke (with one call on the redisplay per keystroke).
Although in the interests of efficency we have stripped down algorithm 5 to algorithm 6 the result is still an algorithm which has a firm theoretical basis and which is superior to the usual ad-hoc approach.
carapace on April 12, 2020 [–]
I wonder how the ratio of display overhead to cost to compute the minimal cost path has changed over the years? At some point it must be cheaper to do a "dumb" redraw if the display hardware is fast enough?
Also, it seems like this might also be useful for computing minimal diffs, eh?
DonHopkins on April 12, 2020 | parent [–]
It was definitely worth it if you had a 300 baud modem and a lightly loaded VAX mostly to yourself. But it became an overkill with higher baud rates and high speed networks.
It's based on the "string-to-string correction problem" (edit distance / Levenshtein distance), combined with "dynamic programming" (finding the best cost path through a cost matrix), which is (kind of) how diff works, on a line-by-line basis for inserted and deleted lines, and then on a character-by-character basis between changed lines. Each "edit" has a "cost" determined by the number of characters you have to send to the terminal (length of the escape codes to perform an edit in multiple possible ways, versus just redrawing the characters). Emacs computes its way through a twisty maze of little escape codes each time it updates the screen.
H. M. Wagner and M. J. Fischer. "The string-to-string correction problem." JACM 21, 1 (January 1974), 168-173
>In computer science, the string-to-string correction problem refers to determining the minimum number of edit operations necessary to change one string into another (i.e., computing the shortest edit distance). A single edit operation may be changing a single symbol of the string into another, deleting, or inserting a symbol. The length of the edit sequence provides a measure of the distance between the two strings.
>In computing, the diff utility is a data comparison tool that calculates and displays the differences between two files. Unlike edit distance notions used for other purposes, diff is line-oriented rather than character-oriented, but it is like Levenshtein distance in that it tries to determine the smallest set of deletions and insertions to create one file from the other. The diff command displays the changes made in a standard format, such that both humans and machines can understand the changes and apply them: given one file and the changes, the other file can be created.
>In information theory, linguistics and computer science, the Levenshtein distance is a string metric for measuring the difference between two sequences. Informally, the Levenshtein distance between two words is the minimum number of single-character edits (insertions, deletions or substitutions) required to change one word into the other. It is named after the Soviet mathematician Vladimir Levenshtein, who considered this distance in 1965.
>In computational linguistics and computer science, edit distance is a way of quantifying how dissimilar two strings (e.g., words) are to one another by counting the minimum number of operations required to transform one string into the other. Edit distances find applications in natural language processing, where automatic spelling correction can determine candidate corrections for a misspelled word by selecting words from a dictionary that have a low distance to the word in question. In bioinformatics, it can be used to quantify the similarity of DNA sequences, which can be viewed as strings of the letters A, C, G and T.
Here's a Levenshtein demo that shows you the cost matrix with the shortest path highlighted. Try calculating the distance between "fotomat" and "tomato", and you can see the shortest path which has a cost of 3 edits (1: delete f at beginning, 2: delete o at beginning, 3: insert o at end):
>Levenshtein distance is obtained by finding the cheapest way to transform one string into another. Transformations are the one-step operations of (single-phone) insertion, deletion and substitution. In the simplest versions substitutions cost two units except when the source and target are identical, in which case the cost is zero. Insertions and deletions costs half that of substitutions. This demonstration illustrates a simple algorithm which basically looks at all of the different ways for operations to transform one string to another.
1. you can change which algorithm is used in git diff as multiple are supported
https://luppeng.wordpress.com/2020/10/10/when-to-use-each-of...
2. Google has an edit graph implementation in Go in the cmp package
https://github.com/google/go-cmp/blob/master/cmp/internal/di...