Pasting code from VS into WLW, part 2
Last time in this two-parter, I laid down the basics of the RTF I followed in pasting code from VS to WLW, and some of the helper classes I started off with. This time, we'll look at the parser and the various tricks I used to make sure that the translated HTML was valid and produced the correct look for the code in a web page.
The parser I implemented is pretty much a standard top-down parser. I started off with a method that parsed the RTF document by calling other parsing methods to parse the header and the document, and those parsing methods would in turn call others to parse different chunks of the RTF document. Eventually I'd get to the point where I'd be parsing tokens. Essentially, a top-down parser is like a matryoshka doll, a set of nested dolls. The result of each parsing step would be some HTML markup.
Let's see how that works. First up is the highest level parsing method which sets up the initial state. It is this method that will be called by the WLW plug-in.
public static string ParseRtf(string rtfValue) { ParserState state = new ParserState(rtfValue); IParseResult result = ParseRtf(state); if (result.Failed) return "***FAILED***"; string html = result.Value; if (state.InBackgroundSpan) html += "</span>"; if (state.InFontColorSpan) html += "</span>"; return html; }
ParseRtf() gets the RTF document as a string (if you look here, you'll see that's what the Clipboard object returns), creates a new parser state object with it, and then calls the controlling top-level ParseRtf() method. On return, if the parser failed, we just return a simple error message string. (It looks a little tacky perhaps, but I'm not particularly bothered: I'll see it immediately in WLW and will be able to fix it. Without fail, it'll be because I forgot to copy the code properly from VS, and so the "fix" will be to try again.) If the parse succeeded, I need to make sure that any background or foreground color spans are properly terminated. I'll get to the reason for this in a moment when I talk about parsing the document content.
Here's the top-level parse method:
private static IParseResult ParseRtf(ParserState state) { if (state.Current != '{') return FailedParse.Get; state.Advance(); IParseResult result = ParseHeader(state); if (result.Failed) return result; result = ParseDocument(state); if (result.Failed) return result; if (state.Current != '}') return FailedParse.Get; state.Advance(); return result; }
When this method is called it expects to see the very first brace of an RTF document. If it doesn't, fail immediately. Otherwise, jump over the brace, and then parse the header. If that failed, return immediately. If it succeeded, parse the document content. If that, in turn, failed, return immediately. If it succeeded, I expect to see the final closing brace of the whole RTF document. If not, fail immediately. Otherwise, jump over the final brace and return.
You can see from this the general process for all the parsing methods: it's a sequence of calling some more specific parse method and, if it failed, to return immediately with a failure. If it succeeded, go on to the next more-specific parse method and do the same. Sometimes, you do a check for a particular character you expect to be present in the RTF stream and fail if it's not there. Having described the general process, I won't draw attention to it again, unless there's something specific I want to discuss.
Let's now parse the header block for an RTF document, at least the very simplified header VS gives us and bearing in mind that we don't really care about much of it, apart from the color table.
private static IParseResult ParseHeader(ParserState state) { IParseResult result; do { result = ParseHeaderKeyword(state); if (result.Failed) return result; } while (state.Current != '{'); do { result = ParseHeaderGroup(state); if (result.Failed) return result; } while (state.Current != '\\'); return result; }
This method is the first where I really throw away a lot of stuff I'm not interested in. In essence, I make the assumption that there are a bunch of keywords (that is, alphanumeric identifiers preceded by backslashes), followed by a set of header groups (data enclosed in braces). We'll see that one of these header groups is the color table. The header groups are terminated by a backslash, which happens to be the first keyword of the document content. (Please check out the example RTF document here.)
I'll make the point again that I am deliberately simplifying the RTF header to suit my very specific need to convert code in RTF into HTML markup. The header has way more structure than this, but, again, I'm not interested. Don't take this code and apply it to a general word-processing document!
The ParseHeaderKeyword() method is very simple and makes use of a ParseKeyword() method to read the identifier:
private static IParseResult ParseKeyword(ParserState state) { StringBuilder sb = new StringBuilder(); do { sb.Append(state.Current); state.Advance(); } while (char.IsLetterOrDigit(state.Current)); if (state.Current == ' ') state.Advance(); return new SuccessfulParse(sb.ToString()); } private static IParseResult ParseHeaderKeyword(ParserState state) { if (state.Current != '\\') return FailedParse.Get; state.Advance(); return ParseKeyword(state); }
Essentially: a keyword starts with a backslash, has a bunch of alphanumeric characters, and might have a terminating space that we should skip. I do return the keyword (without the backslash) as part of a successful result.
The ParseHeaderGroup() method is next:
private static IParseResult ParseHeaderGroup(ParserState state) { if (state.Current != '{') return FailedParse.Get; state.Advance(); IParseResult result = ParseHeaderKeyword(state); if (result.Failed) return result; if (result.Value == "colortbl") { result = ParseColors(state); } else { result = ParseHeaderGroupData(state); } if (state.Current != '}') return FailedParse.Get; state.Advance(); return result; }
The group starts with an opening brace and has a keyword that describes the group. If that keyword is colortbl, I need to parse out the colors and create a list to help with parsing the document content. Otherwise I just need to parse the remainder of the group. At the end I expect to see the closing brace, which can I skip.
Let's get parsing the group data out of the way (in essence, I ignore it all):
private static IParseResult ParseHeaderGroupData(ParserState state) { StringBuilder sb = new StringBuilder(); while (state.Current != '}') { if (state.Current == '{') { IParseResult result = ParseHeaderGroup(state); if (result.Failed) return result; sb.Append(result.Value); } else { sb.Append(state.Current); state.Advance(); } } return new SuccessfulParse(sb.ToString()); }
The only interesting thing about this is that a header group may have another header group embedded in it. (Take a look at the font table in the RTF document.) So I have to make sure I track the opening and closing braces.
The color table in the RTF document consists of a set of colors, each terminated with a semicolon.
private static IParseResult ParseColors(ParserState state) { do { StringBuilder sb = new StringBuilder(); while (state.Current != ';') { sb.Append(state.Current); state.Advance(); } state.ColorTable.Add(sb.ToString()); state.Advance(); } while (state.Current != '}'); return new SuccessfulParse(""); }
The method parses each color out as a string in the form \red43\green145\blue175 and calls the ColorTable objects' Add() method to add each of them. The delimiting semicolons are jumped over. The method finishes when it sees the final closing brace of the group.
We've now seen all the header parsing methods that we need for our particular application. In essence, we ignore everything except for the embedded color table, which we extract into a list of colors for our own purposes. Now the fun stuff: the document content, the actual colorized code from VS.
The essential process here is to read the content character by character, building an HTML encoded string as we go. We stop when we get to the closing brace of the entire RTF document. If the current character is not a backslash, we convert it to an HTML entity if needed (so & becomes & for example), and append it to the HTML string. If the current character is a backslash, we may be seeing an escaped character or it may be a start of a keyword. If the former, we encode it if needed and add it to the HTML. If the latter, we need to process that keyword, whatever it may be.
private static IParseResult ParseDocument(ParserState state) { StringBuilder sb = new StringBuilder(); while (state.Current != '}') { if (state.Current == '\\') { IParseResult result = ParseDocEscapedChar(state); if (result.Succeeded) { sb.Append(ConvertEntity(state.Current)); state.Advance(); } else { result = ParseDocKeyword(state); if (result.Failed) return result; string s = ProcessDocKeyword(result.Value, state); sb.Append(s); } } else { sb.Append(ConvertEntity(state.Current)); state.Advance(); } } return new SuccessfulParse(sb.ToString()); }
This method has a slew of simple methods that need little to no explanation. First, escaping a character fails if the next letter is alphanumeric (since it would then be a keyword not an escaped character), succeeds otherwise. Notice that the method will jump over the backslash.
private static IParseResult ParseDocEscapedChar(ParserState state) { state.Advance(); if (char.IsLetterOrDigit(state.Current)) return FailedParse.Get; return new SuccessfulParse(state.Current.ToString()); }
The conversion to an HTML entity is pretty simple and might need to be extended if you use other characters in your code that need converting to an entity.
public static string ConvertEntity(char current) { switch (current) { case '&': return "&"; case '<': return "<"; case '>': return ">"; default: return current.ToString(); } }
And now, finally we get to the really interesting method, the one that processes a keyword in the content.
Some background first. There are three keywords that we're going to process, all the others are ignored. The three are:
\cfN: set the font color to color N, where N is an index into the color table.\cf0means "set the font to the default font color".\cbN: set the background color to color N, where N is again an index into the color table.\cb0means "set the background to the default background color".\par: output a new line.
Seems simple enough, but there is a problem with the color keywords.
In essence what I'm going to do is to issue <span> tags around text that is of a different color. So to take an example from our RTF document, I'm going to convert the RTF
\cf1 namespace\cf0
into the HTML
<span style="color: #0000ff">namespace</span>
In other words, replace the "change color to N" keyword with an opening span tag, styled to the correct color, and the "revert to the default color" keyword with the relevant closing tag.
If all the RTF was like this, there would be no problem. However, check out the following code fragment:
return String.Format(
Its RTF version is this:
\cf1 return\cf0 \cf4 String\cf5 .\cf0 Format(
That first color change is the type I've already identified as simple. The second one is worse. In RTF it says: change the font color to 4, output "String", change the font color to 5, output ".", revert to the default font color. In other words, RTF doesn't surround text with begin color, end color pairs — which we'd like to have since that's how HTML works — but acts more like a stream: start this color, output some text, start this other color, output text, start another color, output text, etc, etc. There's essentially no "stop using this color" keyword, although we can use \cf0 for that.
So in the conversion code I have to track whether we're in a "font color span" and in a "background color span". If we are and we receive a color change keyword that is not the "revert to the default" keyword, we have to output a closing span tag before we open up another span tag.
Here's another code fragment:
return String.
which has the following RTF:
{...header stuff...
\fs24 \cf1 return\cf0 \cf4 String\cf5 .}
Notice that there is no "revert to default color" keyword at the end of the content. We're left dangling in "font color 5" mode. No can do in HTML, so that's why in the very top of the parsing tree, I had those extra checks to output </span> tags if they were needed.
Having provided the background, it should be easy to understand the rather complicated, somewhat unrefactored ProcessDocKeyword() method
private static string ProcessDocKeyword(string keyword, ParserState state) { int colorIndex; Color color; string format; Regex regex = new Regex(@"([a-z]+)(\d*)"); Match match = regex.Match(keyword); string keywordBase = match.Groups[1].Value; switch (keywordBase) { case "cf": colorIndex = int.Parse(match.Groups[2].Value); if (colorIndex == 0) { state.InFontColorSpan = false; return "</span>"; } color = state.ColorTable[colorIndex]; format = "<span style=\"color: #{0:x2}{1:x2}{2:x2};\">"; if (state.InFontColorSpan) format = "</span>" + format; state.InFontColorSpan = true; return String.Format(format, color.R, color.G, color.B); case "cb": colorIndex = int.Parse(match.Groups[2].Value); if (colorIndex == 0) { state.InBackgroundSpan = false; return "</span>"; } color = state.ColorTable[colorIndex]; format = "<span style=\"background-color: #{0:x2}{1:x2}{2:x2};\">"; if (state.InBackgroundSpan) format = "</span>" + format; state.InBackgroundSpan = true; return String.Format(format, color.R, color.G, color.B); case "par": return Environment.NewLine; } return String.Empty; }
And that's it: converting an RTF document from a copy operation of some selected code in Visual Studio to HTML that we can then paste into Windows Live Writer through a plug-in.
There are bound to be some bugs in this, but it works for how I use the syntax highlighting colors in the VS editor. For example, if you're a fan of the darker, more restful color themes that have colored backgrounds, such as those here, it fails:
namespace RtfToHtml { // The result of a parse operation public interface IParseResult { bool Succeeded { get; } bool Failed { get; } string Value { get; } } }
I can see from this that the background span is being closed off and not the font color span. I'm guessing that my simple "in font color span" and "in background span" bools won't cut it and you'll possibly have output the span to include both foreground and background colors. Mind you, in this case, I think the whole enclosing div should be output with the background color rather than the actual text, as I have here. Another feature, another day.










