Posts tagged with 'parser'

Pasting code from VS into WLW, part 2

Last time in this two-parter, I laid down the basics of the RTF I followed in pasting code from VS to WLW, and some of the helper classes I started off with. This time, we'll look at the parser and the various tricks I used to make sure that the translated HTML was valid and produced the correct look for the code in a web page.

The parser I implemented is pretty much a standard top-down parser. I started off with a method that parsed the RTF document by calling other parsing methods to parse the header and the document, and those parsing methods would in turn call others to parse different chunks of the RTF document. Eventually I'd get to the point where I'd be parsing tokens. Essentially, a top-down parser is like a matryoshka doll, a set of nested dolls. The result of each parsing step would be some HTML markup.

Let's see how that works. First up is the highest level parsing method which sets up the initial state. It is this method that will be called by the WLW plug-in.

    public static string ParseRtf(string rtfValue) {
      ParserState state = new ParserState(rtfValue);
      IParseResult result = ParseRtf(state);
      if (result.Failed)
        return "***FAILED***";
      string html = result.Value;
      if (state.InBackgroundSpan)
        html += "</span>";
      if (state.InFontColorSpan)
        html += "</span>";
      return html;
    }

ParseRtf() gets the RTF document as a string (if you look here, you'll see that's what the Clipboard object returns), creates a new parser state object with it, and then calls the controlling top-level ParseRtf() method. On return, if the parser failed, we just return a simple error message string. (It looks a little tacky perhaps, but I'm not particularly bothered: I'll see it immediately in WLW and will be able to fix it. Without fail, it'll be because I forgot to copy the code properly from VS, and so the "fix" will be to try again.) If the parse succeeded, I need to make sure that any background or foreground color spans are properly terminated. I'll get to the reason for this in a moment when I talk about parsing the document content.

Here's the top-level parse method:

    private static IParseResult ParseRtf(ParserState state) {
      if (state.Current != '{') return FailedParse.Get;
      state.Advance();
      IParseResult result = ParseHeader(state);
      if (result.Failed) return result;
      result = ParseDocument(state);
      if (result.Failed) return result;
      if (state.Current != '}') return FailedParse.Get;
      state.Advance();
      return result;
    }

When this method is called it expects to see the very first brace of an RTF document. If it doesn't, fail immediately. Otherwise, jump over the brace, and then parse the header. If that failed, return immediately. If it succeeded, parse the document content. If that, in turn, failed, return immediately. If it succeeded, I expect to see the final closing brace of the whole RTF document. If not, fail immediately. Otherwise, jump over the final brace and return.

You can see from this the general process for all the parsing methods: it's a sequence of calling some more specific parse method and, if it failed, to return immediately with a failure. If it succeeded, go on to the next more-specific parse method and do the same. Sometimes, you do a check for a particular character you expect to be present in the RTF stream and fail if it's not there. Having described the general process, I won't draw attention to it again, unless there's something specific I want to discuss.

Let's now parse the header block for an RTF document, at least the very simplified header VS gives us and bearing in mind that we don't really care about much of it, apart from the color table.

    private static IParseResult ParseHeader(ParserState state) {
      IParseResult result;
      do {
        result = ParseHeaderKeyword(state);
        if (result.Failed) return result;
      } while (state.Current != '{');
      do {
        result = ParseHeaderGroup(state);
        if (result.Failed) return result;
      } while (state.Current != '\\');
      return result;
    }

This method is the first where I really throw away a lot of stuff I'm not interested in. In essence, I make the assumption that there are a bunch of keywords (that is, alphanumeric identifiers preceded by backslashes), followed by a set of header groups (data enclosed in braces). We'll see that one of these header groups is the color table. The header groups are terminated by a backslash, which happens to be the first keyword of the document content. (Please check out the example RTF document here.)

I'll make the point again that I am deliberately simplifying the RTF header to suit my very specific need to convert code in RTF into HTML markup. The header has way more structure than this, but, again, I'm not interested. Don't take this code and apply it to a general word-processing document!

The ParseHeaderKeyword() method is very simple and makes use of a ParseKeyword() method to read the identifier:

    private static IParseResult ParseKeyword(ParserState state) {
      StringBuilder sb = new StringBuilder();
      do {
        sb.Append(state.Current);
        state.Advance();
      } while (char.IsLetterOrDigit(state.Current));
      if (state.Current == ' ')
        state.Advance();
      return new SuccessfulParse(sb.ToString());
    }

    private static IParseResult ParseHeaderKeyword(ParserState state) {
      if (state.Current != '\\') return FailedParse.Get;
      state.Advance();

      return ParseKeyword(state);
    }

Essentially: a keyword starts with a backslash, has a bunch of alphanumeric characters, and might have a terminating space that we should skip. I do return the keyword (without the backslash) as part of a successful result.

The ParseHeaderGroup() method is next:

    private static IParseResult ParseHeaderGroup(ParserState state) {
      if (state.Current != '{') return FailedParse.Get;
      state.Advance();
      IParseResult result = ParseHeaderKeyword(state);
      if (result.Failed) return result;
      if (result.Value == "colortbl") {
        result = ParseColors(state);
      }
      else {
        result = ParseHeaderGroupData(state);
      }
      if (state.Current != '}') return FailedParse.Get;
      state.Advance();
      return result;
    }

The group starts with an opening brace and has a keyword that describes the group. If that keyword is colortbl, I need to parse out the colors and create a list to help with parsing the document content. Otherwise I just need to parse the remainder of the group. At the end I expect to see the closing brace, which can I skip.

Let's get parsing the group data out of the way (in essence, I ignore it all):

    private static IParseResult ParseHeaderGroupData(ParserState state) {
      StringBuilder sb = new StringBuilder();
      while (state.Current != '}') {
        if (state.Current == '{') {
          IParseResult result = ParseHeaderGroup(state);
          if (result.Failed)
            return result;
          sb.Append(result.Value);
        }
        else {
          sb.Append(state.Current);
          state.Advance();
        }
      }
      return new SuccessfulParse(sb.ToString());
    }

The only interesting thing about this is that a header group may have another header group embedded in it. (Take a look at the font table in the RTF document.) So I have to make sure I track the opening and closing braces.

The color table in the RTF document consists of a set of colors, each terminated with a semicolon.

    private static IParseResult ParseColors(ParserState state) {
      do {
        StringBuilder sb = new StringBuilder();
        while (state.Current != ';') {
          sb.Append(state.Current);
          state.Advance();
        }
        state.ColorTable.Add(sb.ToString());
        state.Advance();
      } while (state.Current != '}');
      return new SuccessfulParse("");
    }

The method parses each color out as a string in the form \red43\green145\blue175 and calls the ColorTable objects' Add() method to add each of them. The delimiting semicolons are jumped over. The method finishes when it sees the final closing brace of the group.

We've now seen all the header parsing methods that we need for our particular application. In essence, we ignore everything except for the embedded color table, which we extract into a list of colors for our own purposes. Now the fun stuff: the document content, the actual colorized code from VS.

The essential process here is to read the content character by character, building an HTML encoded string as we go. We stop when we get to the closing brace of the entire RTF document. If the current character is not a backslash, we convert it to an HTML entity if needed (so & becomes &amp; for example), and append it to the HTML string. If the current character is a backslash, we may be seeing an escaped character or it may be a start of a keyword. If the former, we encode it if needed and add it to the HTML. If the latter, we need to process that keyword, whatever it may be.

    private static IParseResult ParseDocument(ParserState state) {
      StringBuilder sb = new StringBuilder();
      while (state.Current != '}') {
        if (state.Current == '\\') {
          IParseResult result = ParseDocEscapedChar(state);
          if (result.Succeeded) {
            sb.Append(ConvertEntity(state.Current));
            state.Advance();
          }
          else {
            result = ParseDocKeyword(state);
            if (result.Failed) return result;
            string s = ProcessDocKeyword(result.Value, state);
            sb.Append(s);
          }
        }
        else {
          sb.Append(ConvertEntity(state.Current));
          state.Advance();
        }
      }
      return new SuccessfulParse(sb.ToString());
    }

This method has a slew of simple methods that need little to no explanation. First, escaping a character fails if the next letter is alphanumeric (since it would then be a keyword not an escaped character), succeeds otherwise. Notice that the method will jump over the backslash.

    private static IParseResult ParseDocEscapedChar(ParserState state) {
      state.Advance();
      if (char.IsLetterOrDigit(state.Current))
        return FailedParse.Get;

      return new SuccessfulParse(state.Current.ToString());
    }

The conversion to an HTML entity is pretty simple and might need to be extended if you use other characters in your code that need converting to an entity.

    public static string ConvertEntity(char current) {
      switch (current) {
        case '&': return "&amp;";
        case '<': return "&lt;";
        case '>': return "&gt;";
        default:
          return current.ToString();
      }
    }

And now, finally we get to the really interesting method, the one that processes a keyword in the content.

Some background first. There are three keywords that we're going to process, all the others are ignored. The three are:

  • \cfN: set the font color to color N, where N is an index into the color table. \cf0 means "set the font to the default font color".
  • \cbN: set the background color to color N, where N is again an index into the color table. \cb0 means "set the background to the default background color".
  • \par: output a new line.

Seems simple enough, but there is a problem with the color keywords.

In essence what I'm going to do is to issue <span> tags around text that is of a different color. So to take an example from our RTF document, I'm going to convert the RTF

\cf1 namespace\cf0

into the HTML

<span style="color: #0000ff">namespace</span>

In other words, replace the "change color to N" keyword with an opening span tag, styled to the correct color, and the "revert to the default color" keyword with the relevant closing tag.

If all the RTF was like this, there would be no problem. However, check out the following code fragment:

return String.Format(

Its RTF version is this:

\cf1 return\cf0  \cf4 String\cf5 .\cf0 Format(

That first color change is the type I've already identified as simple. The second one is worse. In RTF it says: change the font color to 4, output "String", change the font color to 5, output ".", revert to the default font color. In other words, RTF doesn't surround text with begin color, end color pairs — which we'd like to have since that's how HTML works — but acts more like a stream: start this color, output some text, start this other color, output text, start another color, output text, etc, etc. There's essentially no "stop using this color" keyword, although we can use \cf0 for that.

So in the conversion code I have to track whether we're in a "font color span" and in a "background color span". If we are and we receive a color change keyword that is not the "revert to the default" keyword, we have to output a closing span tag before we open up another span tag.

Here's another code fragment:

return String.

which has the following RTF:

{...header stuff...
\fs24 \cf1 return\cf0  \cf4 String\cf5 .}

Notice that there is no "revert to default color" keyword at the end of the content. We're left dangling in "font color 5" mode. No can do in HTML, so that's why in the very top of the parsing tree, I had those extra checks to output </span> tags if they were needed.

Having provided the background, it should be easy to understand the rather complicated, somewhat unrefactored ProcessDocKeyword() method

    private static string ProcessDocKeyword(string keyword, ParserState state) {
      int colorIndex;
      Color color;
      string format;

      Regex regex = new Regex(@"([a-z]+)(\d*)");
      Match match = regex.Match(keyword);
      string keywordBase = match.Groups[1].Value;
      switch (keywordBase) {
        case "cf":
          colorIndex = int.Parse(match.Groups[2].Value);
          if (colorIndex == 0) {
            state.InFontColorSpan = false;
            return "</span>";
          }
          color = state.ColorTable[colorIndex];
          format = "<span style=\"color: #{0:x2}{1:x2}{2:x2};\">";
          if (state.InFontColorSpan)
            format = "</span>" + format;
          state.InFontColorSpan = true;
          return String.Format(format, color.R, color.G, color.B);
        case "cb":
          colorIndex = int.Parse(match.Groups[2].Value);
          if (colorIndex == 0) {
            state.InBackgroundSpan = false;
            return "</span>";
          }
          color = state.ColorTable[colorIndex];
          format = "<span style=\"background-color: #{0:x2}{1:x2}{2:x2};\">";
          if (state.InBackgroundSpan)
            format = "</span>" + format;
          state.InBackgroundSpan = true;
          return String.Format(format, color.R, color.G, color.B);
        case "par":
          return Environment.NewLine;
      }
      return String.Empty;
    }

And that's it: converting an RTF document from a copy operation of some selected code in Visual Studio to HTML that we can then paste into Windows Live Writer through a plug-in.

There are bound to be some bugs in this, but it works for how I use the syntax highlighting colors in the VS editor. For example, if you're a fan of the darker, more restful color themes that have colored backgrounds, such as those here, it fails:

namespace RtfToHtml {
  // The result of a parse operation
  public interface IParseResult {
    bool Succeeded { get; }
    bool Failed { get; }
    string Value { get; }
  }
}

I can see from this that the background span is being closed off and not the font color span. I'm guessing that my simple "in font color span" and "in background span" bools won't cut it and you'll possibly have output the span to include both foreground and background colors. Mind you, in this case, I think the whole enclosing div should be output with the background color rather than the actual text, as I have here. Another feature, another day.

Pasting code from Visual Studio into Windows Live Writer

Way back in January this year, I briefly explained how I was pasting code into my blog posts so that they were displayed fully syntax-highlighted. At the time I said I'd explain how the underlying parser works, but never got round to it. Well, it's the Friday after Thanksgiving (so-called Black Friday), and I'm feeling voluble. Besides my colleague Mehul Harry just asked me on Twitter how I did it.

The basic process goes like this: Select a bunch of code from Visual Studio's editor and copy it to the clipboard. VS actually copies it there in two formats: plain text, which we most assuredly don't want, and in RTF (Rich Text Format), which we do since it contains all that juicy colorful syntax highlighting. All that remains is to read the RTF data, parse it, and output it as HTML, which we can then easily paste into WLW (Windows Live Writer) using the standard plug-in architecture (which I talked about last time).

Here's an example of what the RTF looks like. This code:

namespace RtfToHtml {
  // The result of a parse operation
  public interface IParseResult {
    bool Succeeded { get; }
    bool Failed { get; }
    string Value { get; }
  }
}

gets copied to the clipboard as RTF like this:

{\rtf1\ansi\ansicpg\lang1024\noproof1252\uc1 \deff0{\fonttbl{\f0\fnil\fcharset0\fprq1 Consolas;}}{\colortbl;
\red0\green0\blue255;\red255\green255\blue255;\red0\green0\blue0;\red255\green255\blue0;\red43\green145\blue175;}
\fs24 \cf1 namespace\cf0  RtfToHtml \{\par 
  \cb4\highlight4 // The result of a parse operation\par 
\cb0\highlight0   \cf1 public\cf0  \cf1 interface\cf0  \cf5 IParseResult\cf0  \{\par 
    \cf1 bool\cf0  Succeeded \{ \cf1 get\cf0 ; \}\par 
    \cf1 bool\cf0  Failed \{ \cf1 get\cf0 ; \}\par 
    \cf1 string\cf0  Value \{ \cf1 get\cf0 ; \}\par 
  \}\par 
\}}

Now, this blog is not the place to explain the full RTF spec. For that you can download the current version from MSDN, and have at it. For our purposes — and I emphasize that what I say here is just enough for us to get by, and if you try and write a general purpose text document RTF parser for a word-processor from this description, you're well and truly nuts and deserve everything you get — it's sufficient to note the basic structure:

{
header stuff
document content
}

Keywords in RTF are formed from letters and digits, prefixed with a backslash, and end with the first non-alphanumeric character. If that character is a space, it's assumed to be part of the keyword and has to be skipped over. For example, in the above data, the document content starts with the keyword \fs24. The space immediately after that is not part of the content, it's just there to terminate the keyword (which, in essence, states that the font size is 24 half-points, which you'll see we shall ignore). The next keyword, \cf1, is also terminated with a space, which again is not part of the content. Thus, "namespace" will start in column 1, as it should. (Incidentally \cf1 means "use color 1 as the font or foreground color".)

The header stuff has some keywords that introduce some special blocks. These special blocks are surrounded by braces, and we only care about one of them: the colortbl.

Ends of lines in the document content are delimited by \par keywords, not by CR/LF (in fact every CR/LF is totally ignored in RTF). Characters that are important to RTF (like braces, backslashes) are themselves preceded with a backslash. So you can see several "\}" in the data above, for example; they just mean a literal closing brace.

OK, that's all I want to describe about RTF for now. Let's get to the code.

The parser structure I shall use involves several supporting classes. The result of a parse operation (a call to a parsing method) will be of type IParseResult (which you've already seen above). It has three properties: Succeeded, Failed, and Value. I could get away with two obviously, but I just prefer reading Failed in code instead of !Succeeded. The Value property is merely the resulting value of a successful parse.

To help out I created two implementations of IParseResult. The first is for a successful parsing operation:

  public class SuccessfulParse : IParseResult {
    public SuccessfulParse(string value) {
      this.Value = value;
    }
    public bool Succeeded {
      get { return true; }
    }
    public bool Failed {
      get { return !Succeeded; }
    }
    public string Value { get; private set; }
  }

And the second is for a failed parse operation:

  public class FailedParse : IParseResult {
    private static IParseResult instance = new FailedParse();
    private FailedParse() { }
    public bool Succeeded {
      get { return false; }
    }
    public bool Failed {
      get { return !Succeeded; }
    }
    public string Value {
      get { return null; }
    }
    public static IParseResult Get {
      get { return instance; }
    }
  }

Note for the latter, I just create a singleton object: all my failed parses return the same object, the one returned by FailedParse.Get. That's OK: I don't store any extra information about the failure since such a failure only really happens when I forget to copy source code onto the clipboard from VS.

Next up is a class to store the parser's state. The state I need to track is: where I am in the RTF document (that is, the current position in the string I get from the clipboard), the current character at that position, the color table for the document, and a couple of helpful booleans to determine if I'm in a block with a different foreground or background color. The state class also comes with an Advance() method to move the current position on by one character, skipping over CR/LF characters.

  public class ParserState {
    private string rtfValue;
    private int position;

    public ParserState(string rtfValue) {
      this.rtfValue = rtfValue;
      position = -1;
      Advance();
      ColorTable = new ColorTable();
    }

    public void Advance() {
      do {
        position++;
        if (position >= rtfValue.Length) {
          Current = '\0';
          return;
        }
      } while (rtfValue[position] == '\r' || rtfValue[position] == '\n');
      Current = rtfValue[position];
    }

    public char Current { get; private set; }
    public ColorTable ColorTable { get; private set; }
    public bool InFontColorSpan { get; set; }
    public bool InBackgroundSpan { get; set; }
  }

Finally, we come to the ColorTable class itself. This is designed to store the colors we get from the \colortbl block in the header from the RTF document. Here's the RTF color table from the above RTF document, spaced out to make it more legible and to give emphasis to the various tokens:

{
\colortbl
;
\red0\green0\blue255;
\red255\green255\blue255;
\red0\green0\blue0;
\red255\green255\blue0;
\red43\green145\blue175;
}

There are six colors defined here, delimited by semicolons, and RTF numbers them from 0 to 5. The first color (denoted by the semicolon on its own) is the default color, which we'll assume to be black for a font color (for a background color, the default is assumed to be White, and you'll find us special-casing it later).

The class makes use of two methods to add the colors to an internal list: Add(string color) and Add(Color, color). The first uses a regex to separate out the RGB values from the passed-in string, and calls the second to add the actual color value to the list. An empty string is assumed to be Black.

  public class ColorTable {
    List<Color> list = new List<Color>();

    public void Add(string color) {
      if (string.IsNullOrEmpty(color))
        Add(Color.Black);
      else {
        Regex regex = new Regex(@"\\red(\d+)\\green(\d+)\\blue(\d+)");
        Match match = regex.Match(color);
        if (!match.Success) throw new Exception("Invalid color value in color table");
        Add(Color.FromArgb(int.Parse(match.Groups[1].Value),
                           int.Parse(match.Groups[2].Value),
                           int.Parse(match.Groups[3].Value)));
      }
    }

    public void Add(Color color) {
      list.Add(color);
    }

    public Color this[int index] {
      get { return list[index]; }
    }
  }

An alternative would be to generate the HTML color values at this stage instead since we don't really need the Color values, just the HTML hex strings. 

Next time, we'll delve into the parser code itself.

Now playing:
Yello - Oh Yeah (Big Room Vocal Mix)
(from Oh Yeah 'Oh Six The Remixes)


Search

About Me

I'm Julian M Bucknall, the M because it's my middle initial and because I and the other Julian Bucknall (the movie guy) would like to differentiate ourselves.

I'm a programmer by trade, an actor by ambition, and an algorithms guy by osmosis. I write articles for PCPlus in my spare time, not that there's much of that.

Julian M Bucknall Apart from that, an ex-pat Brit, atheist, microbrew enthusiast, Pet Shop Boys fanboy, slide rule and HP calculator collector, amateur photographer, Altoids muncher.

DevExpress

I'm Chief Technology Officer at Developer Express, a software company that writes some great controls and tools for .NET and Delphi. I'm responsible for the technology oversight and vision of the company.

Validation

Validate markup as HTML5 (beta)     Validate CSS

Bottom swirl

Archives

February 2012 (4)
SMTWTFS
« Jan  
1234
567891011
12131415161718
19202122232425
26272829

Like this Archive Calendar widget? Download it here.

Social networking

Google ads

The OUT Campaign

The OUT Campaign

My Tweets

Bottom swirl