Today I want to write about something I’ve been working ages ago - specifically in March I wanted to see if I can extend a Java compiler to support LINQ expressions, too.

I probably spend more time on finding a good open-source compiler to experiment with than I later spent on trying things out, so let me share my preferred source with you: http://openjdk.java.net/ is a good address to start with. More specifically http://openjdk.java.net/groups/compiler/ contains some valuable information about the way the compiler works. A nice thing is that there is a branch that has added support for ANTLR which makes adding new language features a tad bit easier as you get to change a grammar file instead of having to tweak hand-written lexers and parsers. More info about it can be found at http://openjdk.java.net/projects/compiler-grammar/. You can download the source code from http://hg.openjdk.java.net/ - don’t follow the link to http://hg.openjdk.java.net/compiler-grammar/compiler-grammar, that one will only allow you to download part of the branch (and nothing interesting either, which was very frustrating at the beginning).

I didn’t come around to add support for LINQ in the end, but to get known to the compiler and the ANTLR grammer, I added support for the var keyword as known from C#, which allows for automatic type deduction and for anonymous objects (again using the C# syntax). Thus my changes allowed for the following to compile and execute correctly:

public class Test {
    public static void main(String[] args) {
        // automatic type deduction
        var t = Math.atan(1);
        System.out.println( t );

        // anonymous type
        var i = new { Amount = 108, message = "hello" };
        System.out.println( i.Amount );
    }
}

Automatic Type Deduction

Let’s take a look at how I added support for the var keyword, which requires an initializer at the variable declaration and automatically deduces the type used and uses that.

This itself is not a feature that I’d actually recommend using in general because it obfuscates a variable’s type[[:obviously…]] and makes it harder to read and understand the code but in conjunction with LINQ and anonymous types, it is very useful, because you don’t want to know the type of the query[[:and you actually can’t if the result is based on an anonymous type]].

The nice thing about ANTLR is that it’s.. nice. I’m going post the diff from my repository to show how easy it was to add support for the var keyword in the grammar file:

It originally looked like this:

localVariableDeclaration returns [com.sun.tools.javac.util.List list]
    @init {
        [...]
        }
    @after {
        [...]
        }
    : variableModifiers type
        {
            mods = $variableModifiers.tree;
            type = $type.tree;
        }
    va1=variableDeclarator
        {
            JCExpression ntype = pu.makeTypeArray(type,$va1.i, $va1.dimPosition, $va1.endPosition);
            JCStatement ntype1 = T.at($va1.pos).VarDef(mods, $va1.name, ntype, $va1.tree);
            //pu.storeEnd(ntype1, $va1.stop);
            ptree = ntype1;
            listBuffer.append(ntype1);
        }
    (cm=',' va2=variableDeclarator
        {
            JCExpression ntype = pu.makeTypeArray(type,$va2.i, $va2.dimPosition, $va2.endPosition);
            JCStatement ntype1 = T.at($va2.pos).VarDef(mods, $va2.name, ntype, $va2.tree);
            pu.storeEnd(ptree, $cm);
            ptree = ntype1;
            listBuffer.append(ntype1);
        }
    )*
;

I changed it to:

localVariableDeclaration returns [com.sun.tools.javac.util.List list]
    @init {
        [...]
    }
    @after {
        [...]
    }
    : variableModifiers
        {
            mods = $variableModifiers.tree;
        }
    (VAR |
    type
        {
            type = $type.tree;
        }
    )

    va1=variableDeclarator
        {
            JCExpression ntype = null;
            if( type != null )
                ntype = pu.makeTypeArray(type,$va1.i, $va1.dimPosition, $va1.endPosition);
            JCStatement ntype1 = T.at($va1.pos).VarDef(mods, $va1.name, ntype, $va1.tree);
            //pu.storeEnd(ntype1, $va1.stop);
            ptree = ntype1;
            listBuffer.append(ntype1);
        }
    (cm=',' va2=variableDeclarator
        {
            JCExpression ntype = null;
            if( type != null )
                ntype = pu.makeTypeArray(type,$va1.i, $va1.dimPosition, $va1.endPosition);
            JCStatement ntype1 = T.at($va2.pos).VarDef(mods, $va2.name, ntype, $va2.tree);
            pu.storeEnd(ptree, $cm);
            ptree = ntype1;
            listBuffer.append(ntype1);
        }
    )*
;

The code changes the way a local variable declaration works by using (VAR | type) instead of type and later in the grammar the VAR token is added VAR : ‘var’;  and a lookahead rule also needs to be adapted[[:trial and error ftw…]]:

localVariableHeader
    : variableModifiers (type|VAR) IDENTIFIER ('['']')* ('='|','|';')
;

The code doesn’t enforce that a var variable needs to have an initializer, this is later done in the actual implementation. It would be easy to add a flag to the variableDeclarator but it would require even more changes to the grammar file.

Now we only need to run ANTLR and regenerate the parser and lexer from the grammar and we’re done with this part.

The main change is in visitVarDef in MemberEnter (which completes the Enter stage - see http://openjdk.java.net/groups/compiler/doc/compilation-overview/index.html for more info):

public void visitVarDef(JCVariableDecl tree) {
    Env localEnv = env;
    if ((tree.mods.flags & STATIC) != 0 ||
            (env.info.scope.owner.flags() & INTERFACE) != 0) {
        localEnv = env.dup(tree, env.info.dup());
        localEnv.info.staticLevel++;
    }
    // old: attr.attribType(tree.vartype, localEnv);
    // BlackHC: deduce the type from the initializer if we have a variant
    if( tree.vartype == null ) {
        if( tree.init != null ) {
            tree.vartype = make.Type(attr.attribExpr(tree.init, localEnv));
            tree.vartype.type = tree.init.type;
        }
        else {
            log.error(tree.pos, "initializer.required.for.implicit.type");
            return;
        }
    }
    else {
        attr.attribType(tree.vartype, localEnv);
    }

    Scope enclScope = enter.enterScope(env);
    VarSymbol v =
            new VarSymbol(0, tree.name, tree.vartype.type, enclScope.owner);
    v.flags_field = chk.checkFlags(tree.pos(), tree.mods.flags, v, tree);
    tree.sym = v;
    if (tree.init != null) {
        v.flags_field |= HASINIT;
        if ((v.flags_field & FINAL) != 0 && tree.init.getTag() != JCTree.NEWCLASS) {
            Env initEnv = getInitEnv(tree, env);
            initEnv.info.enclVar = v;
            v.setLazyConstValue(initEnv(tree, initEnv), log, attr, tree.init);
        }
    }
    if (chk.checkUnique(tree.pos(), v, enclScope)) {
            chk.checkTransparentVar(tree.pos(), v, enclScope);
        enclScope.enter(v);
    }
    annotateLater(tree.mods.annotations, localEnv, v);
    v.pos = tree.pos;
}

The code simply initializes the initializer expression’s type early and sets the variable’s type to it.

Because of this Attr’s visitVarDef needs to be adapted to avoid recreating the type later, we have to add a hack but since it’s prototype code it’s not that big an issue. In line 735):

// BlackHC: this if condition is a hack to keep anonymous objects, etc. from breaking >_<
if( tree.init.type != tree.vartype.type )
    attribExpr(tree.init, initEnv, v.type);

Now only an additional line needs to be added to res/compiler.properties to add the error message text that should appear if the initializer is missing and we’re done (in line 475):

compiler.err.initializer.required.for.implicit.type=
initializer required for implicitly-typed variables

I also added a line to com.sun.tools.javac.main.Main’s compile function to display a custom string to make sure that the correct compiler is run, but that’s just cosmetic.

With this, a new keyword has been added to the Java compiler with a few lines being changed only. The compiler itself is not that straight-forward to understand if you’re not used to its design, but it’s still amazing that it’s that easy. It took me about 15 hours at most to implement this feature. 80% of the time was spent looking through the code and grammar and identifying how to best add the keyword and implement it.

If you want to test your compiler something like the following command-line is needed (on Windows)[[:or you can configure Eclipse accordingly..]]:

java -cp MyJavaC\bin;antlrworks-1.2.3.jar com.sun.tools.javac.Main Test\Test.java

### Anonymous Objects

This was an even simpler feature to implement that did not require any code changes at all. The change only allows for local var variables but this is just because we only changed the localVariableDeclaration rule. Adding anonymous objects (ie. new { fieldName = initializer [, ...] }) is straight-forward once you have automatic type deduction and if you think about it, it’s obvious that it’s nothing but a rewrite of new Object() { public var fieldName = initializer; [...] }. ANTLR shows its strength here:

creator returns [JCExpression tree]
        @init {
            [...]
        }
    : 'new' nonWildcardTypeArguments cr1=classOrInterfaceType cl1=classCreatorRest
        {
            [...]
        }
    | 'new' cr2=classOrInterfaceType cl2=classCreatorRest
        {
            createdName = $cr2.tree;
            args = $cl2.list;
            body = $cl2.tree;
            $tree = T.at(pos).NewClass(null, typeArgs, createdName, args, body);
            pu.storeEnd($tree, $cl2.stop);
        }
    // BlacHC: add C# anonymous types
    | 'new' '{' typebody=anonymousTypeBody b2='}'
        {
            createdName = T.at(pos).Ident(names.fromString("Object"));
            $tree = T.at(pos).NewClass(null, typeArgs, createdName, args, $typebody.tree);
            pu.storeEnd($tree, $b2);
        }
    | arrayCreator
        {
            $tree = $arrayCreator.tree;
        }
;

anonymousTypeBody returns [JCClassDecl tree]
        @init {
            ListBuffer<JCTree> defs = new ListBuffer<JCTree>();
            JCTree ptree = null;
            String dc = ((AntlrJavacToken) $start).docComment;
        }
        @after {
            JCModifiers mods = T.at(Position.NOPOS).Modifiers(0);
            $tree = T.at(((AntlrJavacToken) $start).getStartIndex()).AnonymousClassDef(mods, defs.toList());
            if (ptree != null) {
                pu.storeEnd(ptree, $stop);
            }
        }
    : (va1=variableDeclarator
        {
            JCVariableDecl tree = T.at($va1.pos).VarDef(T.at(Position.NOPOS).Modifiers(Flags.PUBLIC | Flags.FINAL), $va1.name, null, $va1.tree);
            pu.attach(tree, dc);
            ptree = tree;
            defs.append(tree);
        }
    (cm=',' va2=variableDeclarator
        {
            JCVariableDecl tree = T.at(va2.pos).VarDef(T.at(Position.NOPOS).Modifiers(Flags.PUBLIC | Flags.FINAL), $va2.name, null, $va2.tree);
            pu.storeEnd(ptree, $cm);
            ptree = tree;
            pu.attach(tree, dc);
            defs.append(tree);
        }
    )*)?
;

This is all that is needed. The code is mostly copy’n’pasted from other rules (classOrInterfaceType and classCreatorRest) and it wasn’t really that difficult. The long compile times of ANTLR were the only obstacles when writing it.

Try it

I’ve uploaded my current sources (ready to compile and run) and you can download the zip here. Just execute the compileAndRun.bat and my javac and the test should be compiled and run.

What Else?

This is it for today. But let me tell you about a few final thoughts:

  • Hacking away on compiler code and grammar files is a lot fun
  • It’s not feasible for real projects, because you don’t want to start questioning the validity of the compiler you’re using - I had that with QuakeC and it wasn’t fun at all - and chasing compiler bugs is terrible in general when you want to spend your time working on project’s actual code
  • I have started working on a preprocessor that would read in Java code with the extended syntax and emit normal 1.6 Java code. If you want to do something like this, you can probably find a grammar of your language on ANTLR’s homepage - for Java it is: http://openjdk.java.net/projects/compiler-grammar/antlrworks/Java.g.
  • This approach is more difficult though because you suddenly lose the nice functionality that gives you an expression’s type for free (which is a non-trivial thing to code on your own if you consider imports and local classes, etc.).
  • It’s best thing to do, if you want to extend the language and it’s a good idea to give it a thought if you have a project whose code could greatly benefit from some additional language features that can be easily emulated using normal code, too.
  • ANTLR is nice for quick prototyping even though the hand-written Java parser and lexer are faster in general.