Parsing and Grammars - Part 4

January 16th, 2009

Abstract Syntax Tree

In part 2 of this series you saw that a ParserFunc takes source code and returns a ParseResult consisting of a tree representation of the matched source code and the rest of the source code which could not be matched:

public class ParseResult

{

public IAst Ast;

public Source Rest;

}

Since in my last post I already presented the Source class, there is only one thing left to explain — the tree representation of the source code. Lets imagine we want to parse the simple expression:

2 + 3 * 5

One possible tree representation of this expression might be

Example AST

This tree encodes the content and the structure of the source code, e.g. the precedence of * over +.

I will start out with a very simple implemention of abstract syntax trees which I will refine as I go along with my project. The two main ideas are

  1. use an interface to define the basic behavior of nodes (so we have more freedom in plugging in various node types later on and are not forced to include all AST nodes in a single class hierarchy)
  2. nodes are immutable (to play nice with a functional programming style)

Interface IAst

The following implementation reflects these design choices:

/// <summary>

/// Interface for nodes in an abstract syntax tree (AST).

/// </summary>

public interface IAst

{

/// <summary>

/// Gets source code corresponding to this node.

/// </summary>

Source Source { get; }

/// <summary>

/// Gets index-th child of this node.

/// </summary>

IAst this[int childIndex] { get; }

/// <summary>

/// Enumerates all children of this node.

/// </summary>

IEnumerable<IAst> Children { get; }

}

For a start I defined 4 concrete node types which should be sufficient for some first steps.

  1. AstNode — stores source code corresponding to this node, nothing else (no children)
  2. AstValue<T> — stores a value of type T and associated source code
  3. AstBinaryOp —stores the operator name and references ASTs representing the left and right operands
  4. AstGroup —stores an arbitrary number of child nodes and associated source code

The following sections present the source code of these classes.

AstNode

public class AstNode : IAst

{

private Source m_source;

public AstNode(Source source) { m_source = source; }

#region IAst Members

public Source Source

{

get { return m_source; }

}

public virtual IAst this[int index]

{

get { throw new IndexOutOfRangeException(); }

}

public virtual IEnumerable<IAst> Children

{

get { yield break; }

}

#endregion

}

AstValue<T>

public class AstValue<T> : AstNode

{

public readonly T Value;

public AstValue(T value, Source source)

: base(source)

{

Value = value;

}

public override string ToString()

{

return Value.ToString();

}

}

AstBinaryOp

public class AstBinaryOp : AstNode

{

public readonly IAst Left;

public readonly IAst Right;

public readonly string Op;

public AstBinaryOp(string operatorName, IAst left, IAst right, Source src)

: base(src)

{

Op = operatorName;

Left = left;

Right = right;

}

public override IAst this[int index]

{

get

{

if (index == 0) return Left;

else if (index == 1) return Right;

else throw new IndexOutOfRangeException();

}

}

public override IEnumerable<IAst> Children

{

get { yield return Left; yield return Right; }

}

public override string ToString()

{

return Op;

}

}

AstGroup

public class AstGroup : AstNode

{

private IAst[] m_items;

public AstGroup(IEnumerable<IAst> items)

: base(Source.Union(items.Select(x => x.Source)))

{

m_items = items.ToArray();

}

public AstGroup(params IAst[] items)

: this(items.AsEnumerable())

{  }

public override IAst this[int index]

{

get { return m_items[index]; }

}

public override IEnumerable<IAst> Children

{

get { return m_items; }

}

}

In my next post I plan to present the basic parser functions (Part 5), and then we can finally define a simple grammar and start parsing some source code (Part 6). I also plan to post the complete source code starting with Part 6.

Language , , ,

Parsing and Grammars - Part 3

January 10th, 2009

Before we take a look at various basic parsers and parser combinators as promised in my previous post, I would like to spend some time on source code representation (this post) and Abstract Syntax Trees (next post).

Source Code Representation

Lets quickly revisit the definition of a parser function. Remember, that such a function takes source code and returns a ParseResult consisting of the structure it matched and the rest of the source code it could not match — or null if it failed to match anything.

public delegate ParseResult ParserFunc(Source source);

Basically, source code is nothing more than a string. But, lets asume we have a parser function that matches a digit from ‘0′ to ‘9′, and our source looks like this:

34521, 344543, 7687, 45467, 3234, 4546, 284, …

If we apply our parser function it will match and return the first character ‘3′ since it is a digit, but it will also return the rest (shown in red)

4521, 344543, 7687, 45467, 3234, 4546, 284, …

and this rest could be potentially a very long string itself. So we would create an almost identical copy of the original string, which is very inefficient. Since, on the lowest level, we always match single characters or short strings (e.g. keywords) we always have this same situation. So if we look at the worst case scenario, an original piece of source code of length 10000 would generate rest strings of lengths 9999, 9998, 9997, …, 0. The summed length would be 50 000 000. Although we would never need all these strings in memory at once, it nevertheless creates a lot of overhead to simply construct all these strings, copy millions of characters, and puts a lot of pressure on the garbage collector.

In order to avoid this problem, I created a Source type, which basically stores a string reference and the valid range within this string. Thus it is possible that multiple (or all) Source objects reference the same source string (no copying!). The memory footprint of one Source instance is small, consisting of 1 reference to the source string, and 2 integers for the valid range within this string. Furthermore, the Source type is a struct, so no pressure is put on the garbage collector at all.

Finally, as you can see in the following code, the Source type does not really reference a string, but an array of SourceChar values. The idea behind this is, that later on, it would be really nice if I could identify the exact location of each element in the AST with the corresponding location within the original source code (line numbers and columns). This will help to generate useful error messages if a Source contains syntax errors, as well as with debugging the parser itself (and it only adds a constant overhead to the size of the source code).

Struct Source

public struct Source

{

public static readonly Source Empty = new Source(null, 0, 0);

public static readonly Source Invalid = new Source(null, -1, -1);

private SourceChar[] m_chars;

private int m_offset;

private int m_count;

public Source(string sourceCode)

{

m_chars = sourceCode.AsSourceChars().ToArray();

m_offset = 0;

m_count = m_chars.Length;

}

public bool IsEmpty { get { return m_count == 0; } }

public int Length { get { return m_count; } }

public SourceChar this[int index]

{

get

{

if (index < 0 || index >= m_count)

throw new IndexOutOfRangeException();

return m_chars[m_offset + index];

}

}

public Source Take(int count)

{

return new Source(m_chars, m_offset, Math.Min(count, m_count));

}

public Source Skip(int count)

{

int a = Math.Min(m_offset + count, m_offset + m_count);

int b = Math.Max(m_count - count, 0);

return new Source(m_chars, a, b);

}

public Source SkipWhitespace()

{

int count = 0;

while (count < Length && char.IsWhiteSpace(this[count].Value))

count++;

if (count == 0) return this;

int a = Math.Min(m_offset + count, m_offset + m_count);

int b = Math.Max(m_count - count, 0);

return new Source(m_chars, a, b);

}

public override string ToString()

{

return new string(

m_chars.Skip(m_offset).Take(m_count).Select(c => c.Value)

.ToArray());

}

public static Source Union(IEnumerable<Source> sources)

{

if (sources == null) throw new ArgumentNullException();

if (sources.Count() == 0) return Source.Empty;

// check if all sources refer to the same underlying source

if (sources.Where(s => !s.IsEmpty).GroupBy(x => x.m_chars).Count() > 1)

throw new ArgumentException(

“Called Source.Union on different sources.”);

int start = sources.Select(x => x.m_offset).Min();

int end = sources.Select(x => x.m_offset + x.m_count).Max();

return new Source(sources.First().m_chars, start, end - start);

}

private Source(SourceChar[] chars, int offset, int count)

{

m_chars = chars; m_offset = offset; m_count = count;

}

}

Struct SourceChar

public struct SourceChar

{

public char Value;

public int Line;

public int Column;

}

AsSourceChars Extension

public static partial class Extensions

{

public static IEnumerable<SourceChar> AsSourceChars(this string s)

{

if (s == null) yield break;

int line = 1;

int col = 1;

foreach (var c in s)

{

if (c == ‘r’) continue;

if (c == ‘n’) { line++; col = 1; continue; }

yield return new SourceChar

{ Value = c, Line = line, Column = col++ };

}

}

}

Language , , ,

Parsing and Grammars - Part 2

January 6th, 2009

In my last post I showed some examples and sketches showing a BNF-like specification of grammars inside C#. At the end I promised to describe the implementation of this internal DSL.

Lets revisit the following code snippet (I substituted the var keywords with the actual type):

Parser DIGIT = P.Char(char.IsDigit);

Parser DIGITS = P.OneOrMore(DIGIT);

A parser is basically a function that takes some source code and tries to match a specific structure at the beginning of this source code. If it fails, it returns null. If it succeeds it returns a node (or tree) representation of the matched structure (a so-called Abstract Syntax Tree or AST), and it also returns the rest of the source code which it could not match.

/// <summary>

/// A ParserFunc is a function that takes a source,

/// matches some structure at its beginning, and

/// returns a ParseResult consisting of a tree

/// representation of the matched structure and

/// the rest of the source.

/// </summary>

public delegate ParseResult ParserFunc(Source source);

/// <summary>

/// Return value of a ParserFunc, consisting of

/// a tree representation of the matched structure,

/// and the rest of the source code (that was not

/// matched).

/// </summary>

public class ParseResult

{

public IAst Ast;

public Source Rest;

}

So now we have a ParserFunc, but still no Parser - what’s the problem? — The problem is, that I would like to use custom operators for a very succinct syntax, e.g.

var INTEGER =

(MINUS | PLUS) & DIGITS

| DIGITS

;

instead of something like

var INTEGER =

((MINUS.Or(PLUS)).And(DIGITS))

.Or(DIGITS)

;

The latter we can achieve easily using extension methods on the delegate type, e.g.

public static class ParserFuncExtensions

{

public static ParserFunc Or(this ParserFunc f, ParserFunc g)

{ throw new NotImplementedException(); }

public static ParserFunc And(this ParserFunc f, ParserFunc g)

{ throw new NotImplementedException(); }

}

but unfortunately it is not possible to define custom operator overloads for delegate types, hence we need a workaround. Wrapping a ParserFunc inside a class will do the trick, because now we can define custom operators for this class.

public class Parser

{

public ParserFunc Func { get; set; }

public static Parser operator |(Parser a, Parser b)

{

return new Parser(a.Func.Or(b.Func));

}

public static Parser operator &(Parser a, Parser b)

{

return new Parser(a.Func.And(b.Func));

}

}

The implementation of Or is really simple — try the first function, and if it fails try the second one.

public static ParserFunc Or(this ParserFunc f, ParserFunc g)

{

return source => f(source) ?? g(source);

}

We just created a function, that takes two functions as input and returns another function.

The implementation of And is only slightly more complex. We need to try the first function, and if it fails, then the complete And fails. If the first function succeeds, then we have to try the second function on the rest. If it fails, then again the complete And fails. If the second function succeeds, then we have to combine the results of both functions and we are done.

public static ParserFunc And(this ParserFunc f, ParserFunc g)

{

return source =>

{

// try f, and if it fails, then And fails

var fResult = f(source);

if (fResult == null) return null;

// try g on the rest, and if it fails, then And fails

var gResult = g(fResult.Rest);

if (gResult == null) return null;

// combine the results of f and g

return new ParseResult

{

Ast = AstNode.Group(fResult.Ast, gResult.Ast),

Rest = gResult.Rest

};

};

}

Now that we have our basic machinery in place we can start fleshing out some missing details like repetitions (zero-or-more, one-or-more), optional elements, and concrete parsers like P.Char used in preceding examples. This already sounds very much like a tagline for the next post, so stay tuned.

Language , ,

Parsing and Grammars - Part 1

January 5th, 2009

Lets start with a very simple grammar: expressions built from integers, +, -, *, /, and ().

Examples:

  • 123
  • -17
  • 3+5
  • 10*((20-5)/3)

When defining the grammar, I would like to write something like this in C#,

var INTEGER =

(MINUS | PLUS) & DIGITS

| DIGITS

;

where MINUS, PLUS, and DIGITS are variables representing parsers, which are combined into a more complex parser named INTEGER using custom overloaded & and |operators.

The |operator tries the first parser and if it fails uses the second one (alternate). The &operator uses the first parser and then the second parser on what is left (sequence).

So the example above specifies that an INTEGER is

  • either a MINUS or a PLUS character followed by DIGITS,
  • or just DIGITS

and looks very much like a BNF-Grammar - but INSIDE of C#!

Of course there are limits. It would be nice to write something like FOO+ to denote 1 or more repetitions of FOO, but clearly this is not possible since there is no unary postfix +operator in C#.

It would also be nice to directly embed literals inside the grammar rules, e.g.

var INTEGER =

(‘-’ | ‘+’) & DIGITS

| DIGITS

;

but this also is not possible because when defining custom operator overloads at least one operand has to be of the type of the class in which the overload is defined in.

As a workaround I specified a static class P (as in Parser) containing a number of helper methods that create parsers:

public static class P

{

// Character literal.

public static Parser Char(char value) { … }

public static Parser Char(Func<char, bool> predicate) { … }

// String literal.

public static Parser Literal(string value) { … }

// Alternate (try the first parser, if it fails try the next, and so on).

// This is the n-ary variant of the |operator.

public static Parser OneOf(params Parser[] parsers) { … }

// Sequence (use the first parser, then the next on the rest, and so on).

// This is the n-ary variant of the &operator.

public static Parser Seq(params Parser[] parsers) { … }

// Repetition. Use the specified parser one or more times (greedy).

public static Parser OneOrMore(Parser parser) { … }

}

Now we can define the DIGITS parser used above,

var DIGIT = P.Char(char.IsDigit);

var DIGITS = P.OneOrMore(DIGIT);

where a DIGIT is a character for which the method char.IsDigit returns true, and DIGITS are simply one ore more occurences of DIGIT.

So, what is this Parser type, and how exactly did I implement all of this? Stay tuned for the next post!

Language , ,

Implementing a Programming Language

January 3rd, 2009

As a side project I will try to implement a simple programming language using C# and the DLR (Dynamic Language Runtime). The general idea is to create the parser and all the necessary plumbing in C# and to use the DLR as the backend runtime. This way I will not have to deal with low-level IL and can focus on the language instead. Since I also like to play around with functional-style programming the parser should be based somewhere along the lines of parser combinators and allow for a very declarative style of defining the language without the need to use 3rd-party parser generators. Stay tuned for the next post which will describe a first sketch of the parser.

Language ,