Lean diffs for browser-based text editors
TL;DR
Have a look at textdiff-create and textdiff-patch.
The story
As part of the documentation process for a project, I’m currently studying the internal mechanics of online content publishing systems that power the web. Blogs, news portals, arguably even the social networks — they all have one thing in common: some people produce and publish content for other people to read. And a still important part of that content is the almighty text.
Up until recently, writing and publishing were two distinct steps. News reporters, novel authors and even part of the bloggers were writing in offline text editors / word processors prior to distributing their work online.
Technology has changed. Mobile data connections are omnipresent, at least in urban areas, portable devices became affordable, and modern web technologies allow editing large blocks of rich text directly in the browser. Lots of people stopped using Word in favour of Google Docs. Writing articles like this one on coderwall is another good example.
There are also excellent open-source solutions any skilled developer could integrate in his/her web applications. We’re way past tormenting users with plain and awkward <textarea>
tags now. Ace Editor and CodeMirror are perfectly usable with files holding hundreds of lines of code. ProseMirror will soon enable users to easily edit entire ebooks in the browser, even on mobile devices. But there’s another technical challenge developers are facing when building products such as Medium, where users can input virtually unlimited text: content persistence. In other words, you have to keep that text somewhere reliable, which unfortunately excludes localStorage.
But sending the entire ebook to the server on each save/autosave is impractical, even on HTTP 2.0, and especially from applications running on mobile devices.
So, what if, instead of sending the entire block of text to the server on each save, you’d submit a “minimal patch” for each operation, containing just the relevant differences between the client-side version and the server-side version?
There are many text diff tools in the Node.js ecosystem, and basically this is how they work:
As you can see, they generate comprehensive but redundant information: they’ll track the content that has been deleted, the new content and the parts that remain unchanged. But for the purpose of optimising the data sent to the server we’d ideally want an out-of-the-box solution able to generate lean output free of redundant information. So we can discard the deleted and unchanged parts and only keep track of their lengths:
With that in mind, I’ve built and released under the liberal ISC license a simple 2-parts solution: textdiff-create and textdiff-patch. Typically, you’d use textdiff-create in your client application to create delta patches which you’ll apply on the server to the original content with textdiff-patch.
Feel free to have a look at the source code on GitHub, use it and abuse it in your own projects, and don’t hesitate to contribute with PRs if there’s something you would like to change.
The original story was initially published on Medium here.