First, I have created some new pages regarding old university projects. Among them is a condensed page about light propagation volumes, which also made me update the project files to Visual Studio 2012, a page about my bachelor thesis in mathematics (Discrete Elastic Rods) and a page about my master thesis in computer science (Assisted Object Placement). I have not written about the latter two subjects before. Maybe I’ll talk some more about them later and write a full post-mortem on them.
This is the first of a number of posts that will be related to my master thesis, or rather code drops from its code base. I have written about 60k LoC in the 6 months of my master thesis and there are a few bits that might be useful in the future.
The first one that I want to talk about is a very simple file format I came up with. Devising new text file formats is not something that I have been very keen about lately. Especially not as some many already exist. However, I have found none which has really fit my requirements:
- minimal clutter (preferably indentation-based),
- support for raw text inclusion, and
- good C++ support.
JSON has too much clutter and doesn’t support raw text. YAML, on the other hand, sounds like the perfect choice, even though it’s not that easy to find a good library for it. However, when it comes to raw text, you run into the issue that tab characters are never allowed as indentation. Moreover, I was not very happy with the API choices and some bugs in the libraries that I tried to use.
So I decided to develop a very simple text-based data storage format:
WML - Whitespace Markup Language
I’ll just start with an example, ie my readme. You can find the full readme here.
'Whitespace Markup Language':
Aims::
* very simple
* no clutter while writing
* only indentation counts
* empty lines have no meaning
* embedding text is easy
* everything is a map internally
Example:
title "test\t\t1"
path 'c:\unescaped.txt'
version 1
content::
unformated text
newlines count here
properties:
time-changed 10:47am
flags archive system hidden
streams:
stream:
data::
some data
this is nested too
flags:
read
write
execute
stream:
data::
key names
dont have to
be unique (see stream)
flags:
read
write:
users andreas root
As you can see, it is a whitespace-based format inspired by Python. Like in Python, indentation is used to convey structure. A WML file represents a map structure: every node has a name and possibly multiple children. A node definition in a WML file either contains the node name first and then multiple children names (which won’t have any children themselves), the node name followed by one colon to signify that a nested definition follows (similar to JSON), or the node name followed by two colons to signify a raw text block. Two different kinds of strings are supported: single-quoted raw strings that do not interpret escape sequences, and double-quoted C strings that support escape sequences.
All in all, the grammar is very simple:
Grammar::
INDENT, DEINDENT are virtual tokens that control the indentation level
NEWLINE is a line break
Indentation is done with tabs only at the moment.
Here is a rough EBNF syntax for WML:
root: map
value: identifier | unescaped_string | escaped_string
identifier: (!whitespace)+
unescaped_string: '\'' (!'\'')* '\''
escaped_string: '"' (!"\"")* '"' with support for \t, \n, \\, \', and \"
key: value
map: map_entry*
map_entry: inline_entry | block_entry
inline_entry: key value+ NEWLINE
block_entry: key ':' ( ':' NEWLINE INDENT textblock DEINDENT | NEWLINE INDENT non-empty map DEINDENT )
Note::
This file is itself a WML file and root["Whitespace Markup Language"]["Example"].data() is the example WML node
I’ve used this format for custom shaders as well as for my settings files and the declaration files of my test scenes:
boxes:
-:
name 'platform, brown'
size 20 2 20
webColor 945412
-:
name 'platform, green'
size 20 2 20
webColor 129429
-:
name 'platform, muddy blue'
size 20 2 20
webColor 6488a5
-:
name 'platform, light blue'
size 20 2 20
webColor e5eaf1
I’ve uploaded the current code for WML to GitHub, and you can find the code here. The API supports an overloaded index operator to access children of a node and contains both a parser and an emitter for WML.
Afterthoughts
First, the current API isn’t brilliant. It would be nice to separate the data model from the parser and emitter by using templates and type traits to improve abstraction. I think I might go and investigate different API types in the future and see which one works best for some simple cases.
Second, it would be possible to reduce clutter even more and remove the need for single colons to denote nested maps. A even simpler format could look like this:
nodeA
nodeB nodeC nodeD
nodeE
nodeF
nodeG:
raw data
with fixed indentation
would be interpreted as "raw data\n\twith fixed intendation\n..."
This would yield the following JSON-equivalent:
{ "nodeA": { "nodeB": {} , "nodeC": {} , "nodeD": {}, "nodeE": {}, "nodeF": {}, "nodeG": { "raw data..." : {} } } }
This still separates raw text from normal data. A node that contains raw text can never contain other children this way. However, I cannot think of a good way to accomplish that without introducing a special character to end a raw text block.
That’s it for now :)