Posts tagged with 'diff'

PCPlus 280: Writing a spellchecker

I write a monthly column for PCPlus, a computer news-views-n-reviews magazine in the UK (actually there are 13 issues a year — there’s an Xmas issue as well — so it’s a bit more than monthly). The column is called Theory Workshop and appears in the Make It section of the magazine. When I signed up, my editor and the magazine were gracious enough to allow me to reprint the articles here after say a year or so. What I’ll do is publish the article from a year ago or so here when I purchase the current issue.

PCPlus logoSince I picked up April 2010’s issue when I was in England just over a week ago, I’m publishing this article from April 2009 slightly earlier than I usually do (which means May 2009’s article will be posted in June or something).

The topic is pretty interesting: how does a spell-checker give you alternatives to a badly-spelled word? The first step — finding out if a word is misspelled — is simple enough as a basic algorithm: just look it up in a big ol’ list o’ words. The next step can seem daunting though: how can you generate a valid list of possible corrections? The article talks about a few possibilities: the Levenshtein distance, the Damerau algorithm, and the Soundex method.

The Levenshtein distance is a bit impractical for a large list of words, although there are certain techniques you can use to speed things up. It essentially calculates the edit distance (the number of character insertions and deletions to go from one word to another) of the misspelled word and every other word in the list. The words with the smallest edit distance are your correction candidates. As I said, possibly too slow for a large dictionary, although it’s a great algorithm for creating diff tools.

The Damerau algorithm essentially calculates all the words that can be formed form the misspelled word by a single mistyped letter and checks those in the dictionary. Those that are present are valid correction candidates. Although single mistakes are common, two or more mistyped letters in a single word is fairly rare, so the algorithm does pretty well (as the article shows).

The Soundex algorithm is a “phonetic” algorithm: it finds candidate words that “sound like” the misspelled word.

Of course the article doesn’t go into deeper, harder algorithms for space reasons. For example, Bloom filters are great for compressing the dictionary at the expense of a few false positives and can be used to replace the article’s hash table for the word list. Different dictionaries could be employed that just store root words and then describe possible suffixes, prefixes, and circumfixes that can go with each word (for example, Hunspell). Better phonetic algorithms now exist than Soundex (for example, Metaphone). Nevertheless, I think the article hangs together quite well and I’m pretty pleased with it.

This article first appeared in issue 280, April 2009.

You can download the PDF here.

Album cover for Plunkett & Macleane [Original Score]Now playing:
Armstrong, Craig - Revelations
(from Plunkett & Macleane [Original Score])


PCPlus 270: An introduction to diffs

I write a monthly column for PCPlus, a computer news-views-n-reviews magazine in the UK (actually there are 13 issues a year — there's an Xmas issue as well — so it's a bit more than monthly). The column is called Theory Workshop and appears in the back of every issue. When I signed up, my editor and the magazine were gracious enough to allow me to reprint the articles here after say a year or so. After all, the PDFs do appear on each issue's DVD after a couple of months. When I buy the current issue, I'll publish the article from the issue a year ago. Since I've now got July's issue, here's July 2008's article.

PCPlus logo Although the article is short, the usual two pages, the topic interests me greatly and I wish I could do it more justice. It's about calculating the longest common subsequence (LCS) of two strings and thereby calculate their edit distance and from that, the diffs or the differences between the two strings. Although the diffs between two strings is not that interesting (to a programmer, at least, although I can imagine it's highly meaningful in genetics), the same process can be applied to two text files and the results of that are of great importance. Every version control system out there stores revisions to a file as some kind of diff, for compression, if nothing else.

The standard algorithm I described in the article is pretty simple to understand, but astonishing in that it's so counter-intuitive. I can't imagine how it came about; although, in reality, it's just a dynamic programming problem, when all's said and done. The algorithm is pretty efficient — I even incorporated it into a demo diff program I wrote at Microsoft to show off the new color support in the Console class during the Whidbey cycle (Visual Studio 2005) — but with large files/strings the amount of memory needed can get unwieldy and things slow down. (The algorithm is quadratic in both time and space: O(n*m) essentially.)

There's a version of the algorithm that requires linear space instead, from Dan Hirschberg's 1975 paper A linear space algorithm for computing maximal common subsequences. There are other improvements that deal with special cases, for example in the fields of genetics.

This article first appeared in issue 270, July 2008.

You can download the PDF here.

Now playing:
Somerville, Jimmy - Was That All It Was
(from Suddenly Last Summer)

Search

About Me

I'm Julian M Bucknall, the M because it's my middle initial and because I and the other Julian Bucknall (the movie guy) would like to differentiate ourselves.

I'm a programmer by trade, an actor by ambition, and an algorithms guy by osmosis. I write articles for PCPlus in my spare time, not that there's much of that.

Julian M Bucknall Apart from that, an ex-pat Brit, atheist, microbrew enthusiast, Pet Shop Boys fanboy, slide rule and HP calculator collector, amateur photographer, Altoids muncher.

DevExpress

I'm Chief Technology Officer at Developer Express, a software company that writes some great controls and tools for .NET and Delphi. I'm responsible for the technology oversight and vision of the company.

Validation

Validate markup as HTML5 (beta)     Validate CSS

Bottom swirl

Archives

February 2012 (4)
SMTWTFS
« Jan  
1234
567891011
12131415161718
19202122232425
26272829

Like this Archive Calendar widget? Download it here.

Social networking

Google ads

The OUT Campaign

The OUT Campaign

My Tweets

Bottom swirl