A little over a year ago I wrote a post about Roslyn.
Roslyn is a Microsoft project to rewrite the VB and C# compilers in their respective languages. After Mads talk, and listening to the (proposed) new features for C# 6 I couldn’t help but feel a little underwhelmed with the direction of C#.
Mads’ justification seems reasonable on the face of it – C# 2, 3,4 & 5 have all had major new language additions (namely generics, linq, dynamic & async) and that C# 6 is more of a ‘tidy the pain points’ release. But I had a number of ideas for improvements, and unfortunately only two are in C#6 (primary constructor and read-only auto-properties).
To make matters worse, it is not one of the goals of Roslyn to allow language extensions – C# spec is still owned by the C# team, and Roslyn wont allow you to add your own syntax. I get this – it would be fairly terrible if every codebase was C# + custom changes. But I can understand the desire from users to at least open source Roslyn (especially as it doesn’t run on Mono according to Glenn Block of ScriptCS fame – although I haven’t verified this myself).
It was this desire to play with language features that MS are not going to implement that drew me to writing my own C# compiler.
Before we begin
Phil Trelford’s excellent series on writing a SmallBasic compiler has to be the the best intro to building a compiler in .NET today – it covers parsing and compiling actual code from SmallBasic. Interestingly we had the similar idea at the same time about writing a parser & compiler for C#.
Now, I didn’t do any compiler theory in uni, I’ve never written a compiler before and I’ve only ever looked at Reflection Emit in anger. But despite these limitations I have a fairly capable C# 0.5 compiler (in that it is a subset of the C# 1 compiler). Which means you can do it too – it took me less than a month.
The source code for the compiler is here. It currently supports compilation of
- Local Variables
- Return values
- for, while and do while loops
- Comparisons and arithmetic
- Method calls (static, instance and “this”, on built in and custom defined types)
- Constructor calls
- Fields (no initialization)
- Custom values – always 0 based
- Empty Structs
Big things currently missing (most are trivial plumbing, but I’ve just not had time or inclination yet) –
- Foreach, lock, switch, if/else
- Full range of modifiers (ie static, virtual, sealed, etc)
- Struct bodies
- Constructor Bodies
- Field initialization
But even with those limitations it’s still possible to write some basic programs.
Given C# is a (fairly) rigid spec, and this is a small version I’ve allowed myself some liberties:
- There is code that should compile that doesn’t (even with above limitations).
- There is code that shouldn’t compile that does.
- As this is my first attempt at a C# Abstract Syntax Tree (Abbr AST) there are bits that are (in retrospect) plainly wrong, but work OK for now. Phil’s C# AST is more robust and at some point I’ll switch to using his.
- You can only compile 1 file at a time.
- Performance is pretty weak (of the compiler). Given the above limitation it’s not a big deal, but you wouldn’t want to build entity framework with it (courtesy of F# for fun and profit).
- There are entirely unimplemented features – some I will address, some I wont (switch for example is so weak in C# it doesn’t feel worth the effort).
- Error messages are virtually non-existent – most compile errors are handled with a “failwith” which will simply throw an exception.
- It can only compile DLLs currently.
I don’t want to go into too much detail about specifics of the code here – the full source is available if you want it, and you can ask me questions in the comments, should you feel the need. What I’d like to talk about is some of the conceptual issues from transforming a piece of text, into a loose structure, into a tighter structure, into IL.
Phil’s blog post about parsing C# is an excellent start of turning C# code into F# Discriminated unions. I took a slightly different direction of using fslex and fsyacc to transform plain text into code, but the basic principle is the same – Discriminated Unions are great for describing Syntax Trees.
Both approaches to parsing have pros and cons. FParsec has some clear advantages of fslex/fsyacc – error messages are much better and you can easily use it from fsi while Fslex/fsyacc require a pre-compile step and are more verbose, but (to me) have an easier to learn syntax and a less steep learning curve. A more detailed comparison of fparsec vs parser-generator can be found here.
Transforming an AST into IL
Once we have a description of the code structure loaded into an AST the fun begins. There are a number of problems solved by the C# compiler that are really convenient for the developer, but are tricky for the compiler writer.
- No header files means that the compiler must decide the optimal path for generating types (unlike languages with header files where the public type definition must be predefined – by the programmer).
- Mutually recursive types means that you cant simply compile in the order the code is presented.
- Void as a type/not a type (Void exists in the system namespace, yet it doesn’t make sense as a type in any other place than a method declaration)
- Null as a thing, but not really a thing.
You can think of the AST loaded from a text file as a loose representation of the code. You can write a completely valid AST that is utterly nonsensical from an execution point of view (for example calling non-existent methods, using undeclared values, type-mismatches, etc). One of the 1st tasks we need to do with this loose structure is to build it into a strongly typed and validatable AST. In the source I’ve simply gone from Expr to TExpr (where T stands for Typed).
To be able to build a typed representation we first need to stub out the type (class, interface or struct). This is because we may have a Method that returns an instance of a class defined later in the code. By “stubbing” the types we can stash them away to look up later.
Once we have stubbed the types, we need to do the same thing with the body of a type – a method may depend on a field, a property or another method, so we define the stub for methods ready to fill in later, but keep the method stub in a structure (in fact a tuple of method builder and a function to build it’s body) – think function prototypes in C. By the end of this stage we have a complete list of all user defined types, and members of those types, but importantly, there is no actual *code* emitted – purely code structure.
The “builder” function returned then converts the AST to a “Typed” AST, which is then validatable – I currently skip validation and skip to IL Generation. But it is now rich in metadata about the code – metadata which is used to convert the Typed AST into CIL.
The premise is actually deceptively simple – the devil is in the details.
- The Reflection.Emit APIs do inheritance patently wrong – this cost me about 2 days of development time (although had I RTFM first I could have avoided this)
- TypeBuilder derives from Type, so exposes Type’s methods, but they fall over at runtime unless you call CreateType first, which prevents further modification. The same problems apply to MethodBuilder (derives from MethodInfo), ConstructorBuilder (derives from ConstructorInfo), FieldBuilder (derives from FieldInfo) and PropertyBuilder (derives from PropertyInfo)
- This means any userdefined types must be stored seperately from referenced assembly types for appropriate lookup on methods, fields, etc, rather than calling existing methods. F# really helps keep the ceremony down as you can partially apply functions pre-loaded with these definitions, rather than passing lots of data-structures around.
- IL Generation is fairly finnicky – for example to call a method on a struct local variable you must load the struct using ldloca instead of the usual ldloc.
- F# Pattern matching makes light work of handling these cases – once you know they exist.
- The entire IL generator code which covers most use cases is ~150LOC. This is the most complex file in the project, but is hardly difficult.
People seem to be waiting for Roslyn to solve either meta-programming or scripting problems they either have now or think they’ll have in the future. The excellent .NET Rocks podcast has had a lot of people on recently to talk about what they’ve either done with Roslyn or plan to do, and it feels like a lot of it is entirely plausible without Roslyn.
The spirit of this post is to encourage you to not wait and to build it yourself. You probably don’t need the full C# 5 experience for a scripting language in your app. Rather than wait (Roslyn still has no official release date) I want to show you that it is not only possible, but entirely feasible to do this right now with a little thought as to what you are going to achieve – the provided source provided weighs in around only 800 lines of code so it’s not heavyweight by any stretch. The worst case scenario is that you rewrite your feature using Roslyn (once it is released)/Mono C# Compiler/CSharpCodeProvider, and you have some fun and learn something new in the meantime. Which doesn’t seem like such a bad case. Remember kids – this is the year of code – time to stop procrastinating and write something!
F# is a powerful choice for writing a compiler – Discriminated Union types, Record-Types, pattern-matching, immutable-by-default (the code provided is almost purely immutable – I cheated in 1 place, with a generic dictionary) and 1st class function and tuples makes light work of defining a syntax tree and transforming it into executable IL code. In fact I can’t begin to imagine how painful it would be to write a compiler without these features……