PCPlus 264: Regular expressions

I write a monthly column for PCPlus, a computer news-views-n-reviews magazine in the UK (actually there are 13 issues a year — there's an Xmas issue as well — so it's a bit more than monthly). The column is called Theory Workshop and appears in the back of every issue. When I signed up, my editor and the magazine were gracious enough to allow me to reprint the articles here after say a year or so. After all, the PDFs do appear on each issue's DVD after a couple of months.

PCPlus logo This time around I decided to talk about how regular expressions work and how they're implemented. As I mention in the article, regular expressions still fascinate me after many years.

I suppose this all goes back to my fascination with parsers. I like writing parsers, they're hard but fun. I think the ability to write a functional parser (doesn't have to be pretty) is the entryway into becoming a better programmer. After all the main tool we use to create or applications is a compiler: a parser with a specific backend. I'm not saying that the parser you write should be all-encompassing, merely that the act of writing one gives you deeper insight into all kinds of programming ideas and issues. You have to decide whether to be bottom-up or top-down, recursive or not. You have to use trees and other basic data structures. You are learning by doing.

I've now written a couple of regular expression parsers and have written the code to apply them to a string to be matched. Were they the best regex compilers ever written? Heck, no. But I now understand the mechanics of parsing a regex, why certain regexes are very fast and why others would take centuries to complete. I understand about backtracking, and the different ways to implement it. And so on, so forth.

I used this paper as a reference: it reactivated some of the fascination and enthusiasm.

This article first appeared in issue 264, January 2008.

You can download the PDF here.

Album cover for Corps et Armes Now playing:
Daho, Etienne - La Baie
(from Corps et Armes)

Loading similar posts...   Loading links to posts on similar topics...

No Responses

Feel free to add a comment...

Leave a response

Note: some MarkDown is allowed, but HTML is not. Expand to show what's available.

  •  Emphasize with italics: surround word with underscores _emphasis_
  •  Emphasize strongly: surround word with double-asterisks **strong**
  •  Link: surround text with square brackets, url with parentheses [text](url)
  •  Inline code: surround text with backticks `IEnumerable`
  •  Unordered list: start each line with an asterisk, space * an item
  •  Ordered list: start each line with a digit, period, space 1. an item
  •  Insert code block: start each line with four spaces
  •  Insert blockquote: start each line with right-angle-bracket, space > Now is the time...
Preview of response