Posts tagged with 'csv'

PCPlus 258: Parsing comma-separated values

I write a monthly column for PCPlus, a computer news-views-n-reviews magazine in the UK (actually there are 13 issues a year — there's an Xmas issue as well — so it's a bit more than monthly). The column is called Theory Workshop and appears in the back of every issue. When I signed up, my editor and the magazine were gracious enough to allow me to reprint the articles here after say a year or so. After all, the PDFs do appear on each issue's DVD after a couple of months.

PCPlus logo Onto number four in my ongoing set of articles for PCPlus and this time I wanted to talk about state machines, and instead it turned into a discussion about Comma-Separated Values (CSV) files. Unfortunately the strap called them CSU files, but hey-ho.

The rationale for this is a series of prior blog posts and such where I'd investigated and expanded on what it would take to parse the CSV format. These prior investigations turned into a Java implementation when someone converted my C# library. Unlike before when I started out from the language grammar, this time my angle of attack was from the state machine diagram. It's always easier to draw a diagram than to work out the BNF for some grammar.

Unfortunately, this was yet another attempt where I tried to show code in the article. I hadn't yet seen issue 257 where this had proved a failure (I write articles about three months in advance, and Barnes & Noble get copies maybe three weeks after they appear in the UK, so it's almost four months from sending off a zip to me reading the published end-result), and so I continued thinking that there was no problem.

So, download the PDF and follow along with the code displayed here. First of all the actual state machine:

  public interface IState {
    IState Process(char ch);
    bool IsTerminator { get; }
  }

  public static class CsvStateMachine {
    public static void Execute(string text, IState startState) {
      IState currentState = startState;
      foreach (char c in text) {
        currentState = currentState.Process(c);
      }
      if (!currentState.IsTerminator)
        throw new Exception("Done parsing, final field is not complete");
      FieldProcessor.Finish();
    }
  }

The FieldProcessor class:

  public static class FieldProcessor {
    private static string field = String.Empty;
    public static void AddChar(char c) {
      field += c;
    }
    public static void Finish() {
      Console.WriteLine('[' + field + ']');
      field = String.Empty;
    }
  }

And finally the first state as a class:

  public class FieldStartState : IState {
    public IState Process(char ch) {
      switch (ch) {
        case ',':
          FieldProcessor.Finish();
          return this;
        case '"':
          return new ScanQuotedFieldState();
        case ' ':
          return this;
        default:
          FieldProcessor.AddChar(ch);
          return new ScanFieldState();
      }
    }

    public bool IsTerminator {
      get { return true; }
    }
  }

Bizarrely, I cannot now find the actual solution from which this code is gathered. What you see here was copied from my original Word doc, where it is nicely syntax-highlighted, so I must have had the solution in Visual Studio at some point.

The article first appeared in issue 258, August 2007.

You can download the PDF here.

UPDATE: (about half an hour later) I'd recreated the code:

  internal class ScanQuotedFieldState : IState {
    public IState Process(char ch) {
      switch (ch) {
        case '"':
          return new TerminateFieldState();
        default:
          FieldProcessor.AddChar(ch);
          return this;
      }
    }

    public bool IsTerminator {
      get { return false; }
    }
  }

  internal class ScanFieldState : IState {
    public IState Process(char ch) {
      switch (ch) {
        case ' ':
          return new TerminateFieldState();
        case ',':
          FieldProcessor.Finish();
          return new FieldStartState();
        default:
          FieldProcessor.AddChar(ch);
          return this;
      }
    }

    public bool IsTerminator {
      get { return true; }
    }
  }

  internal class TerminateFieldState : IState {
    public IState Process(char ch) {
      switch (ch) {
        case ' ':
          return this;
        case '"':
          FieldProcessor.AddChar(ch);
          return new ScanQuotedFieldState();
        case ',':
          FieldProcessor.Finish();
          return new FieldStartState();
        default:
          throw new Exception("Invalid character after field was terminated.");
      }
    }

    public bool IsTerminator {
      get { return true; }
    }
  }

This shows the version where quotes inside quoted fields are, er, double quoted. If that makes sense...

Finally here's the method that kicks off the parser:

  public static class CsvStateMachine {
    //...
public static void Parse(string text) { Execute(text, new FieldStartState()); } }

 

Album cover for Brothers in Arms Now playing:
Dire Straits - Why Worry
(from Brothers in Arms)

Search

About Me

I'm Julian M Bucknall, the M because it's my middle initial and because I and the other Julian Bucknall (the movie guy) would like to differentiate ourselves.

I'm a programmer by trade, an actor by ambition, and an algorithms guy by osmosis. I write articles for PCPlus in my spare time, not that there's much of that.

Julian M Bucknall Apart from that, an ex-pat Brit, atheist, microbrew enthusiast, Pet Shop Boys fanboy, slide rule and HP calculator collector, amateur photographer, Altoids muncher.

DevExpress

I'm Chief Technology Officer at Developer Express, a software company that writes some great controls and tools for .NET and Delphi. I'm responsible for the technology oversight and vision of the company.

Validation

Validate markup as HTML5 (beta)     Validate CSS

Bottom swirl

Archives

February 2012 (4)
SMTWTFS
« Jan  
1234
567891011
12131415161718
19202122232425
26272829

Like this Archive Calendar widget? Download it here.

Social networking

Google ads

The OUT Campaign

The OUT Campaign

My Tweets

Bottom swirl