PCPlus 267: Use algorithms to break up paragraphs

I write a monthly column for PCPlus, a computer news-views-n-reviews magazine in the UK (actually there are 13 issues a year — there's an Xmas issue as well — so it's a bit more than monthly). The column is called Theory Workshop and appears in the back of every issue. When I signed up, my editor and the magazine were gracious enough to allow me to reprint the articles here after say a year or so. After all, the PDFs do appear on each issue's DVD after a couple of months. When I buy the current issue, I'll publish the article from the issue a year ago. Since I've just bought the issue for April 2009, here's April 2008's article.

PCPlus logo This is an algorithm I particularly like. It's just an ingenious way of working out how to break up paragraphs so that they look less right-ragged than the standard "try and fit as many words on a line until you can't fit any more" algorithm and is a great way of illustrating a dynamic algorithm to boot.

Unfortunately a couple of things didn't come out properly in the typesetting (and I didn't point out the problematic areas when I sent the article in, so it's more my fault than the poor typesetter). The cost matrix shown on the second page is practically impossible to work out without (a) a monospace font and (b) the infinity symbol instead of the ffi ligature. Here's the corrected cost matrix for "She is a diva." with a line length of 6:

 27   0   ∞   ∞
  -  64   8   ∞
  -   - 125   ∞
  -   -   -   0

Also in one of the paragraphs on the second page there's a "2n-1", which makes no sense until you realize that it should be 2n-1.

Looking at the article now, I'm suddenly very struck with the way JFK's speech ended up with a "widow" on the last line. Yuk: I should have changed the algorithm so that splitting the last two words was impossible. Actually that reminds me of something I read recently about how to avoid such widows in HTML and CSS. It seems that, even though CSS is about styling and setting your text, there is no way to prevent word widows like that. The author (and I wish I could remember where I saw this so I could link) recommended either altering your HTML so that the last inter-word space was a non-breaking space (&nbsp;) or writing a bit of JavaScript to do it for you for text within <p> tags after the document was loaded. Possibly writing an add-in for Windows Live Writer to do this as a post is published would be a fun project for a rainy day. Food for thought.

One day, I'll write up the Knuth-Plass algorithm which is of an order of magnitude more difficult to describe and implement.

This article first appeared in issue 267, April 2008.

You can download the PDF here.

Album cover for Eccentrix Remixes Now playing:
Yello - On Track [Doug Laurent's First Journey]
(from Eccentrix Remixes)

Loading similar posts...   Loading links to posts on similar topics...

No Responses

Feel free to add a comment...

Leave a response

Note: some MarkDown is allowed, but HTML is not. Expand to show what's available.

  •  Emphasize with italics: surround word with underscores _emphasis_
  •  Emphasize strongly: surround word with double-asterisks **strong**
  •  Link: surround text with square brackets, url with parentheses [text](url)
  •  Inline code: surround text with backticks `IEnumerable`
  •  Unordered list: start each line with an asterisk, space * an item
  •  Ordered list: start each line with a digit, period, space 1. an item
  •  Insert code block: start each line with four spaces
  •  Insert blockquote: start each line with right-angle-bracket, space > Now is the time...
Preview of response