Background
Recently, I received a requirement: display a diff with 2 JSON strings, which are basically an API's request and response data. Furthermore, we should show the diff in a Git-like style.
Unfortunately, after a brief exploration, I found no suitable library for me.
It seems I should implement it by myself.
A "Git" Way
The JSON string can be pretty informal. For example, these 2 JSON strings are equal:
{"a":1,"b":2}
{ "a" :1, "b": 2}
Luckily, a powerful built-in function, JSON.stringify
, can serialise a JavaScript object to a standard JSON string. Moreover, it can also format the string with the given indent level:
const original = `{ "a" :1, "b": 2}`;
const formatted = JSON.stringify(JSON.parse(original), null, 4);
console.log(formatted);
// ...which will result in:
// {
// "a": 1,
// "b": 2
// }
The result looks pretty much like the lower-right item in the screenshot before. Let's find a proper way to perform the diff.
Since what I need to display is a Git-like style, how about using the diff algorithm used by Git directly?
Git uses the Myers algorithm to generate a beautiful diff result; maybe I can use a Git library. But I found two cons for this method quickly:
First, it can not handle the trailing comma perfectly; I had seen many conflicts in package.json
when I started a merge request. In the above example, the a
line is recognised as "different" only because of its trailing comma.
{ | {
- "a": 1 |
| + "a": 1,
| + "b": 2
} | }
Second, JSON.stringify
can not handle the different key orders. For example:
// the results are different!
JSON.stringify(JSON.parse({"a":1,"b":2}))
JSON.stringify(JSON.parse({"b":2,"a":1}))
So the "Git" way failed; I should consider making a diff method myself.
A Basic JSON Diff Method
First of all, JSON has many types: number
, string
, boolean
, null
, object
, array
. If two values have different types, we can easily say they are different. There is no need to diff them recursively, just output "the left is removed, the right is added". Likewise, if they have the same primitive type, we can easily say they are different, just output "the left is removed, and the right is added".
Now we can only deal with two values with the same type object
or array
. Let's start with object
first.
Object Diff Algorithm
Assume we have two objects, left
and right
, a direct diff algorithm can be:
- Sort all keys for
left
andright
; assume the result iskeysLeft
andkeysRight
. - Use two pointers
p
andq
pointing atkeysLeft[0]
andkeysRight[0]
. - If there is no more value in
keysLeft
, output "the rest ofkeysRight
is added"; if there is no more value inkeysRight
, output "the rest ofkeysLeft
is removed". - Compare the two pointed values:
- If
p
is smaller,p++
, output "p
is removed", add a blank line forq
; - If
q
is smaller,q++
, output "q
is added", add a blank line forp
; If they are equal, increase them and...
- If the values are also equal, output "equal";
- If the values are not the same type, output "
p
is removed, andq
is added". - If the values are both objects, diff them recursively.
- Else (if the values are both arrays), use the array diff method (we will find it later).
Here comes an example:
[start]
p: ↓
keysLeft: a d f g i j k l
keysRight: a b c d i
q: ↑
[step 1]
p: ↓
keysLeft: a d f g i j k l
keysRight: a b c d i
q: ↑
[step 2]
p: ↓
keysLeft: a _ d f g i j k l
keysRight: a b c d i
q: ↑
[...various steps skipped]
[result]
p: ↓
keysLeft: a _ _ d f g i j k l
keysRight: a b c d _ _ i _ _ _
q: ↑
This algorithm looks just like the merging process in the "merge sort" algorithm. By addling empty lines, it can ensure the left and right results have the same number of lines.
Array Diff Algorithm
Compared with the objects, the arrays seem easier to handle. Assume we have two arrays, left
and right
:
- Let
p
andq
point toleft[0]
andright[0]
. - If there is no more value in
left
, output "the rest ofright
is added"; if there is no more value inright
, output "the rest ofleft
is added". If
p
andq
are different:- If
p
andq
are both objects, call object diff; - If
p
andq
are both arrays, diff them recursively; - Else, output "
p
is removed, andq
is added".
- If
- Else, output "equal".
The above algorithms should work for most cases. 🎉
Icing On The Cake
Although the algorithms can give a good result, the result may sometimes make us confusing.
Comply With Intuition
There may be multiple diff results for two JSONs:
- 1 | - 1 | | + 3
- 2 | | + 3 | + 4
| + 3 - 2 | | + 5
| + 4 | + 4 - 1 |
| + 5 | + 5 - 2 |
✅ 🤔 🤔
Obviously, the first one is the best. It complies with our intuition: 1) first "remove" then "add", 2) as many continuous lines with the same type as possible. That is what the output produced by the Mayers algorithm look like. Can we adjust our output to achieve this goal? I came out with a solution inspired by the "bubble sort" algorithm.
If we have a diff result, we might swap two lines to get a better result (in this example, we swap the line 2 and 3):
- 1 | - 1 |
| + 3 - 2 |
- 2 | => | + 3
| + 4 | + 4
| + 5 | + 5
If two adjacent lines, x
and y
, x
is [empty, added]
, y
is [removed, empty]
, we can swap them. The result is a [removed, empty]
followed by [empty, added]
.
The swap process may be executed several times. But like the bubble sort, if there is no swap in the current loop, the algorithm can be terminated early.
Better Matching Algorithm For Arrays
Sometimes, the difference between two arrays is due to the insertion or deletion of only a few items. For example, we should not tell "remove" in this situation:
- 2 | | + 1
- 3 | 2 | 2
| + 1 3 | 3
| + 2
| + 3
🤔 ✅
A dynamic programming algorithm called LCS (short for Longest Common Subsequence) can help us find the longest common subsequence for two arrays. For example, the LCS is [2, 3]
in the previous two arrays.
A fundamental way is we find the LCS, diff the LCS with two arrays. The extra items in a
should be identified as "remove", those in b
should be identified as "added".
But there is an easier way. During the LCS calculation, we use a backtrack
matrix to record current status ("left", "up", "diag"). When backtracking, we can tell that the current item in a
should be removed if the current status is "left" and that the current item in b
should be added if the current status is "up".
And then... Booyah!
Identifying Modifications
Git diff is a good way to display results, but we can do more with JSON strings. For example, if we receive {"a":1}
and {"a":2}
, we can tell that "the field a
is changed from 1 to 2".
Let's review the object part to improve the algorithm:
If
p
andq
are different:
- ...
- Output "
p
is removed, andq
is added".
The output should be "p
is modified to q
".
The array part is a bit more complex. If we use normal matching, just output "a[i]
is modified to b[i]
" is enough. But if we use LCS matching, we should change the backtracking process a little.
We can tell that the current item ina
should be removed if the current status is "left" and that the current item inb
should be added if the current status is "up".
How do we recognise "there is an a remove
followed by b add
" when backtracking? The solution is easy: when encountering an "up", we take a step forward and see if the next step (actually, it's the "previous" step since we are backtracking) is a "left". If a "left" followed by an "up" appears in a path, we can tell that "a[i]
is removed and b[j]
is added". Then it can be optimised to "a[i]
is modified to b[j]
".
So far, we have found an algorithm good enough to handle the diff for JSON strings. We can output these fields for each line:
- Type (add, modify, remove, equal);
- Line number;
- Indent level;
- Text;
- Whether to have a trailing comma.
The interface for diff results can be like this:
export interface DiffResult {
level: number;
type: 'modify' | 'add' | 'remove' | 'equal';
text: string;
comma?: boolean;
lineNumber?: number;
}
The Differ & The Viewer
To decouple the diff and display logic, I write a Differ
class to calculate the diff result and a Viewer
component to display the diff result. I package these logics to a library json-diff-kit (a more complex usage is here), developers can use this library easily:
import { Differ } from 'json-diff-kit';
const before = {
a: 1,
b: 2,
d: [1, 5, 4],
e: ['1', 2, { f: 3, g: null, h: [5], i: [] }, 9],
};
const after = {
b: 2,
c: 3,
d: [1, 3, 4, 6],
e: ['1', 2, 3, { f: 4, g: false, i: [7, 8] }, 10],
};
// all configs are optional
const differ = new Differ({
detectCircular: true, // default `true`
maxDepth: Infinity, // default `Infinity`
showModifications: true, // default `true`
arrayDiffMethod: 'lcs', // default `"normal"`
});
const diff = differ.diff(before, after);
import { Viewer } from 'json-diff-kit';
import type { DiffResult } from 'json-diff-kit';
import 'json-diff-kit/dist/viewer.css';
interface PageProps {
diff: [DiffResult[], DiffResult[]];
}
const Page: React.FC<PageProps> = props => {
return (
<Viewer
diff={props.diff} // required
indent={4} // default `2`
lineNumbers={true} // default `false`
/>
);
};
And the result is clear:
Conclusion
I combine (or am inspired by) several simple algorithms:
- Merging process in merge sort
- Bubble sort
- LCS & backtracking
- Recursion
These algorithms is enough to make a good diff result for two JSON strings.