Al et WML

First, I have created some new pages regarding old university projects. Among them is a condensed page about light propagation volumes, which also made me update the project files to Visual Studio 2012, a page about my bachelor thesis in mathematics (Discrete Elastic Rods) and a page about my master thesis in computer science (Assisted Object Placement). I have not written about the latter two subjects before. Maybe I'll talk some more about them later and write a full post-mortem on them.

This is the first of a number of posts that will be related to my master thesis, or rather code drops from its code base. I have written about 60k LoC in the 6 months of my master thesis and there are a few bits that might be useful in the future.

The first one that I want to talk about is a very simple file format I came up with. Devising new text file formats is not something that I have been very keen about lately. Especially not as some many already exist. However, I have found none which has really fit my requirements:

  • minimal clutter (preferably indentation-based),
  • support for raw text inclusion, and
  • good C++ support.

JSON has too much clutter and doesn't support raw text. YAML, on the other hand, sounds like the perfect choice, even though it's not that easy to find a good library for it. However, when it comes to raw text, you run into the issue that tab characters are never allowed as indentation. Moreover, I was not very happy with the API choices and some bugs in the libraries that I tried to use.

So I decided to develop a very simple text-based data storage format:

WML - Whitespace Markup Language

I'll just start with an example, ie my readme. You can find the full readme here.

'Whitespace Markup Language':

    Aims::
        * very simple
        * no clutter while writing
        * only indentation counts
        * empty lines have no meaning
        * embedding text is easy
        * everything is a map internally


    Example:

        title       "test\t\t1"
        path        'c:\unescaped.txt'
        version     1

        content::
            unformated text

            newlines count here

        properties:
            time-changed    10:47am
            flags   archive system hidden

        streams:
            stream:
                data::
                    some data

                    this is nested too
                flags:
                    read
                    write
                    execute

            stream:
                data::
                    key names
                    dont have to
                    be unique (see stream)
                flags:
                    read
                    write:
                        users   andreas root

As you can see, it is a whitespace-based format inspired by Python. Like in Python, indentation is used to convey structure. A WML file represents a map structure: every node has a name and possibly multiple children. A node definition in a WML file either contains the node name first and then multiple children names (which won't have any children themselves), the node name followed by one colon to signify that a nested definition follows (similar to JSON), or the node name followed by two colons to signify a raw text block. Two different kinds of strings are supported: single-quoted raw strings that do not interpret escape sequences, and double-quoted C strings that support escape sequences.

All in all, the grammar is very simple:

Grammar::
    INDENT, DEINDENT are virtual tokens that control the indentation level
    NEWLINE is a line break

    Indentation is done with tabs only at the moment.

    Here is a rough EBNF syntax for WML:

    root: map

    value: identifier | unescaped_string | escaped_string

    identifier: (!whitespace)+
    unescaped_string: '\'' (!'\'')* '\''
    escaped_string: '"' (!"\"")* '"' with support for \t, \n, \\, \', and \"

    key: value

    map: map_entry*

    map_entry: inline_entry | block_entry

    inline_entry: key value+ NEWLINE
    block_entry: key ':' ( ':' NEWLINE INDENT textblock DEINDENT | NEWLINE INDENT non-empty map DEINDENT )

Note::
    This file is itself a WML file and root["Whitespace Markup Language"]["Example"].data() is the example WML node

I've used this format for custom shaders as well as for my settings files and the declaration files of my test scenes:

boxes:
    -:
        name 'platform, brown'
        size 20 2 20
        webColor 945412
    -:
        name 'platform, green'
        size 20 2 20
        webColor 129429
    -:
        name 'platform, muddy blue'
        size 20 2 20
        webColor 6488a5
    -:
        name 'platform, light blue'
        size 20 2 20
        webColor e5eaf1

I've uploaded the current code for WML to GitHub, and you can find the code here. The API supports an overloaded index operator to access children of a node and contains both a parser and an emitter for WML.

Afterthoughts

First, the current API isn't brilliant. It would be nice to separate the data model from the parser and emitter by using templates and type traits to improve abstraction. I think I might go and investigate different API types in the future and see which one works best for some simple cases.

Second, it would be possible to reduce clutter even more and remove the need for single colons to denote nested maps. A even simpler format could look like this:

nodeA
    nodeB nodeC nodeD
        nodeE
        nodeF
    nodeG:
        raw data
            with fixed indentation

        would be interpreted as "raw data\n\twith fixed intendation\n..."

This would yield the following JSON-equivalent:

{ "nodeA": { "nodeB": {} , "nodeC": {} , "nodeD": {}, "nodeE": {}, "nodeF": {}, "nodeG": { "raw data..." : {} } } }

This still separates raw text from normal data. A node that contains raw text can never contain other children this way. However, I cannot think of a good way to accomplish that without introducing a special character to end a raw text block.

That's it for now :)