Posts tagged with 'hash-table'

PCPlus 293: Building an efficient dictionary

I think this is pretty much the last article I’ve written for PCPlus that discusses algorithms in a fairly formal sense. As I said last time, my editors and I have slowly been moving my articles towards more “how it works” topics than the traditional “layman’s guide to algorithms” subjects I’m perhaps better known for.

PC Plus logoThis one was essentially a discussion of big-Oh notation and how we can use it to categorize algorithms in terms of their overall performance and memory usage. To illustrate the notation, I used the abstract data structure known as an associative array, or dictionary, implementing it in several different ways, and showed each efficiency in terms of big-Oh. I not only discussed the average big-Oh values, but also the best-case and worst-case scenarios.

The implementations I used were unsorted arrays, sorted arrays (and binary search), hash tables, balanced binary search trees, and I briefly touched on radix trees and ternary trees.

Not too bad an introduction to big-Oh, if I say so myself.

This article first appeared in issue 293, April 2010.

You can read the PDF here.

(I write a monthly column for PCPlus, a computer news-views-n-reviews magazine in the UK (actually there are 13 issues a year — there's an Xmas issue as well — so it's a bit more than monthly). The column is called Theory Workshop and appears in the Make It section of the magazine. When I signed up, my editor and the magazine were gracious enough to allow me to reprint the articles here after say a year or so.)

Album cover for Gran RiservaNow playing:
dZihan & Kamien - Stiff Jazz
(from Gran Riserva)


PCPlus 277: Dictionaries and hash tables

I write a monthly column for PCPlus, a computer news-views-n-reviews magazine in the UK (actually there are 13 issues a year — there’s an Xmas issue as well — so it’s a bit more than monthly). The column is called Theory Workshop and appears in the Make It section of the magazine. When I signed up, my editor and the magazine were gracious enough to allow me to reprint the articles here after say a year or so. What I’ll do is publish the article from a year ago or so here when I purchase the current issue.

PCPlus logoJanuary 2009 was an important change for me. It seems that the Editor was pleased enough with my pieces (and I presume so were the readers) that my commission each month expanded to three pages instead of the prior two (or, if you’re counting, 2000 words instead of 1300). Not quite a 50% increase in pay to go along with the 50% increase in surface area — ha! — but to be quite honest I didn’t particularly care, and still don’t: I really enjoy writing for them and the money they pay is pretty good anyway. The change in word count meant that I could start to do my topics in more depth. Before I would sometimes be struggling to contain the topic in the space, but now the extra room allowed me to cover more detail.

The first article in this new expanded section had to be a good one. I decided to cover one of my favourite data structures: the hash table or dictionary.

Even with the extra space, all I could cover was the basic hash table together with linear probing as a collision resolution mechanism (and example of open addressing) and the problems of clustering. Hash tables with open addressing is still one of my favourite ways of implementing a dictionary, and so writing the article was fairly quick.

I liked doing the figures too, although Figure 2 is a bit bizarre without some explanation (time is meant to be read from top to bottom, first we insert a record whose hash resolves to index 2, then one at 17, then one at 11, etc; the more we add records, the more collisions there are and they tend to cluster). The figure cries out for animation, as I discovered recently when I slipped in a couple of images of hash tables to my CTO video on seeing things in black and white. There I wanted to show how disruptive it can be when a hash table grows, and it too needed animation, especially in a video like that. I’ve since found out that Illustrator can produce Flash animations from a series of layers, etc, but I haven’t had a chance to play around with that as yet.

Double-take: a hash table “disruptive”? Indeed, yes, under certain situations. Much is made of the fact that insertion and search in a hash table is O(1), that is, it’s constant time whether there are 10 items in the table or 10,000. Not strictly true, as it happens, it’s more of an amortized O(1) over many insertions or searches. The reason is that, during an insertion, there may be a point when the load factor is too high (as explained in the article, for open addressing that’s assumed to be about 2/3 full), and the hash table has to be grown. This requires allocating a new array (traditionally it’s set to twice the size of the original), and then rehashing and inserting all of the current items into the new array. This is an O(n) operation and it happens every n items, so, amortized, it smears out to a constant addition factor over all n items. Whoopee for the big-Oh notation, but in practice, if you have a hash table containing a huge number of items, the time taken for this growth may be noticeable by the user or by a real-time process that’s not expecting it.

This article first appeared in issue 277, January 2009.

You can download the PDF here.

(Quick aside: PCPlus used to put part of their archive as PDFs on the DVD in the back of the magazine. They’ve now moved to a CD instead of a DVD, presumably to save on costs, and the archive is no longer on there. I hear they’re going to publish it online instead, sometime in the near future.)

Best case, Worse case, Amortized case

A couple of weeks ago, Hacker News pointed to an article about how bad hash tables could be in a worse case scenario.

In essence, the thrust of the argument was that, in a worse case scenario, hash tables degenerate from a very nice O(1) for insertion and deletion of items to O(N). The reason is that hash tables’ quoted run-time efficiency is an amortized value and not an absolute result. If you like, on average over very many insertions and deletions, hash tables behave in a O(1) manner, but there can be cases where a single insertion can be O(N), generally when the table is grown because its load factor grows above a certain value.

The author of the piece then goes on to say that because of this he always uses a balanced binary tree for a dictionary or associative array, because its run-time efficiency is O(log N) for insertions and deletions, even in the worse case. That is, a balanced binary tree, such as a red-black tree, offers a worse-case guarantee that insertions and deletions will not take longer than O(log N).

This is all quite true, but I would venture to guess that the worse case scenario does not occur for the vast majority of applications. For example, let us take the spell checker scenario where we store a wordlist in a hash table. (Ignore for the moment the fact that a trie or a ternary tree might be a better structure than either a hash table or a balanced binary tree for this.) One would assume that the wordlist size would not change drastically during the run of the application with violent swings in the number of words. In which case, the developer would optimize the dictionary, for example, by tweaking the hash function or by initializing the number of elements in the table correctly. In such an example, there would be no such worse case. (I shall ignore the worse case of the crappy developer, as one does in all discussions of run-time efficiency.)

Another example is associative arrays in dynamic languages like JavaScript or Ruby. There you are pretty much forced to use hash tables because they’re, well, part of the infrastructure. Presumably performance testing will reveal problems in your code (again, ignoring the issue of the crap developer who doesn’t test) that you can then fix, if possible.

My viewpoint here that, unless you are aware of wildly varying numbers of name-value pairs in your application, it’s very likely that a hash table will suffice for your dictionary needs. Instead of faster efficiency the vast majority of time, by using a balanced binary tree you are slower all the time.

After all, every implementation of quicksort will have a sequence of input items that will force an O(n^2) run-time. Using random selection of the pivot can alleviate this to a certain extent (that is, make it extremely unlikely to happen). Yet, how many people worry about that? Indeed, how many people know where quicksort is being used in their run-time library?

Mind you, all this is very academic. The real rule is: measure your own code’s performance. Don’t take big-Oh numbers as gospel: for your application and your data, the real world may be very different.

Search

About Me

I'm Julian M Bucknall, the M because it's my middle initial and because I and the other Julian Bucknall (the movie guy) would like to differentiate ourselves.

I'm a programmer by trade, an actor by ambition, and an algorithms guy by osmosis. I write articles for PCPlus in my spare time, not that there's much of that.

Julian M Bucknall Apart from that, an ex-pat Brit, atheist, microbrew enthusiast, Pet Shop Boys fanboy, slide rule and HP calculator collector, amateur photographer, Altoids muncher.

DevExpress

I'm Chief Technology Officer at Developer Express, a software company that writes some great controls and tools for .NET and Delphi. I'm responsible for the technology oversight and vision of the company.

Validation

Validate markup as HTML5 (beta)     Validate CSS

Bottom swirl

Archives

February 2012 (3)
SMTWTFS
« Jan  
1234
567891011
12131415161718
19202122232425
26272829

Like this Archive Calendar widget? Download it here.

Social networking

Google ads

The OUT Campaign

The OUT Campaign

My Tweets

  • @TerriMorton "The Texan-ized Eiffel Tower" <shudders, whimpers in corner> /cc @rachelreese
  • One of my blog readers found this awesome picture of Roger Moore modeling a pullover in a "Father and Son" pattern http://t.co/DRs4dLSu
  • @RachelHawley First Vaseline, then a drill. It's a good job I have no imagination.
Bottom swirl