Pasting code from Visual Studio into Windows Live Writer : Algorithms for the masses

Pasting code from Visual Studio into Windows Live Writer

Way back in January this year, I briefly explained how I was pasting code into my blog posts so that they were displayed fully syntax-highlighted. At the time I said I'd explain how the underlying parser works, but never got round to it. Well, it's the Friday after Thanksgiving (so-called Black Friday), and I'm feeling voluble. Besides my colleague Mehul Harry just asked me on Twitter how I did it.

The basic process goes like this: Select a bunch of code from Visual Studio's editor and copy it to the clipboard. VS actually copies it there in two formats: plain text, which we most assuredly don't want, and in RTF (Rich Text Format), which we do since it contains all that juicy colorful syntax highlighting. All that remains is to read the RTF data, parse it, and output it as HTML, which we can then easily paste into WLW (Windows Live Writer) using the standard plug-in architecture (which I talked about last time).

Here's an example of what the RTF looks like. This code:

namespace RtfToHtml {
  // The result of a parse operation
  public interface IParseResult {
    bool Succeeded { get; }
    bool Failed { get; }
    string Value { get; }
  }
}

gets copied to the clipboard as RTF like this:

{\rtf1\ansi\ansicpg\lang1024\noproof1252\uc1 \deff0{\fonttbl{\f0\fnil\fcharset0\fprq1 Consolas;}}{\colortbl;
\red0\green0\blue255;\red255\green255\blue255;\red0\green0\blue0;\red255\green255\blue0;\red43\green145\blue175;}
\fs24 \cf1 namespace\cf0  RtfToHtml \{\par 
  \cb4\highlight4 // The result of a parse operation\par 
\cb0\highlight0   \cf1 public\cf0  \cf1 interface\cf0  \cf5 IParseResult\cf0  \{\par 
    \cf1 bool\cf0  Succeeded \{ \cf1 get\cf0 ; \}\par 
    \cf1 bool\cf0  Failed \{ \cf1 get\cf0 ; \}\par 
    \cf1 string\cf0  Value \{ \cf1 get\cf0 ; \}\par 
  \}\par 
\}}

Now, this blog is not the place to explain the full RTF spec. For that you can download the current version from MSDN, and have at it. For our purposes — and I emphasize that what I say here is just enough for us to get by, and if you try and write a general purpose text document RTF parser for a word-processor from this description, you're well and truly nuts and deserve everything you get — it's sufficient to note the basic structure:

{
header stuff
document content
}

Keywords in RTF are formed from letters and digits, prefixed with a backslash, and end with the first non-alphanumeric character. If that character is a space, it's assumed to be part of the keyword and has to be skipped over. For example, in the above data, the document content starts with the keyword \fs24. The space immediately after that is not part of the content, it's just there to terminate the keyword (which, in essence, states that the font size is 24 half-points, which you'll see we shall ignore). The next keyword, \cf1, is also terminated with a space, which again is not part of the content. Thus, "namespace" will start in column 1, as it should. (Incidentally \cf1 means "use color 1 as the font or foreground color".)

The header stuff has some keywords that introduce some special blocks. These special blocks are surrounded by braces, and we only care about one of them: the colortbl.

Ends of lines in the document content are delimited by \par keywords, not by CR/LF (in fact every CR/LF is totally ignored in RTF). Characters that are important to RTF (like braces, backslashes) are themselves preceded with a backslash. So you can see several "\}" in the data above, for example; they just mean a literal closing brace.

OK, that's all I want to describe about RTF for now. Let's get to the code.

The parser structure I shall use involves several supporting classes. The result of a parse operation (a call to a parsing method) will be of type IParseResult (which you've already seen above). It has three properties: Succeeded, Failed, and Value. I could get away with two obviously, but I just prefer reading Failed in code instead of !Succeeded. The Value property is merely the resulting value of a successful parse.

To help out I created two implementations of IParseResult. The first is for a successful parsing operation:

  public class SuccessfulParse : IParseResult {
    public SuccessfulParse(string value) {
      this.Value = value;
    }
    public bool Succeeded {
      get { return true; }
    }
    public bool Failed {
      get { return !Succeeded; }
    }
    public string Value { get; private set; }
  }

And the second is for a failed parse operation:

  public class FailedParse : IParseResult {
    private static IParseResult instance = new FailedParse();
    private FailedParse() { }
    public bool Succeeded {
      get { return false; }
    }
    public bool Failed {
      get { return !Succeeded; }
    }
    public string Value {
      get { return null; }
    }
    public static IParseResult Get {
      get { return instance; }
    }
  }

Note for the latter, I just create a singleton object: all my failed parses return the same object, the one returned by FailedParse.Get. That's OK: I don't store any extra information about the failure since such a failure only really happens when I forget to copy source code onto the clipboard from VS.

Next up is a class to store the parser's state. The state I need to track is: where I am in the RTF document (that is, the current position in the string I get from the clipboard), the current character at that position, the color table for the document, and a couple of helpful booleans to determine if I'm in a block with a different foreground or background color. The state class also comes with an Advance() method to move the current position on by one character, skipping over CR/LF characters.

  public class ParserState {
    private string rtfValue;
    private int position;

    public ParserState(string rtfValue) {
      this.rtfValue = rtfValue;
      position = -1;
      Advance();
      ColorTable = new ColorTable();
    }

    public void Advance() {
      do {
        position++;
        if (position >= rtfValue.Length) {
          Current = '\0';
          return;
        }
      } while (rtfValue[position] == '\r' || rtfValue[position] == '\n');
      Current = rtfValue[position];
    }

    public char Current { get; private set; }
    public ColorTable ColorTable { get; private set; }
    public bool InFontColorSpan { get; set; }
    public bool InBackgroundSpan { get; set; }
  }

Finally, we come to the ColorTable class itself. This is designed to store the colors we get from the \colortbl block in the header from the RTF document. Here's the RTF color table from the above RTF document, spaced out to make it more legible and to give emphasis to the various tokens:

{
\colortbl
;
\red0\green0\blue255;
\red255\green255\blue255;
\red0\green0\blue0;
\red255\green255\blue0;
\red43\green145\blue175;
}

There are six colors defined here, delimited by semicolons, and RTF numbers them from 0 to 5. The first color (denoted by the semicolon on its own) is the default color, which we'll assume to be black for a font color (for a background color, the default is assumed to be White, and you'll find us special-casing it later).

The class makes use of two methods to add the colors to an internal list: Add(string color) and Add(Color, color). The first uses a regex to separate out the RGB values from the passed-in string, and calls the second to add the actual color value to the list. An empty string is assumed to be Black.

  public class ColorTable {
    List<Color> list = new List<Color>();

    public void Add(string color) {
      if (string.IsNullOrEmpty(color))
        Add(Color.Black);
      else {
        Regex regex = new Regex(@"\\red(\d+)\\green(\d+)\\blue(\d+)");
        Match match = regex.Match(color);
        if (!match.Success) throw new Exception("Invalid color value in color table");
        Add(Color.FromArgb(int.Parse(match.Groups[1].Value),
                           int.Parse(match.Groups[2].Value),
                           int.Parse(match.Groups[3].Value)));
      }
    }

    public void Add(Color color) {
      list.Add(color);
    }

    public Color this[int index] {
      get { return list[index]; }
    }
  }

An alternative would be to generate the HTML color values at this stage instead since we don't really need the Color values, just the HTML hex strings.

Next time, we'll delve into the parser code itself.

Now playing:
Yello - Oh Yeah (Big Room Vocal Mix)
(from Oh Yeah 'Oh Six The Remixes)

Fri 27-Nov-2009 7:52 PM Blog / tags: plugin live-writer rtf parser

Loading links to posts on similar topics...

previous post next post

1 Response

#1 Pasting code from VS into WLW, part 2 said...

28-Nov-09 3:07 PM

Last time in this two-parter, I laid down the basics of the RTF I followed in pasting code from VS to WLW, and some of the helper classes I started off with. This time, we'll look at the parser and the various tricks I used to make sure that the translated

Leave a response

Note: some MarkDown is allowed, but HTML is not. Expand to show what's available.

Emphasize with italics: surround word with underscores _emphasis_
Emphasize strongly: surround word with double-asterisks **strong**
Link: surround text with square brackets, url with parentheses [text](url)
Inline code: surround text with backticks `IEnumerable`
Unordered list: start each line with an asterisk, space * an item
Ordered list: start each line with a digit, period, space 1. an item
Insert code block: start each line with four spaces
Insert blockquote: start each line with right-angle-bracket, space > Now is the time...

by Julian M Bucknall