Jancy Compiler Overview
The Jancy compiler uses a lexical analyzer generated by Ragel (a universal finite state machine compiler). It is perfectly suited for building lexers due to the convenience and expressiveness of its input language and, even more importantly, the unbeatable performance of the output code.
On the backend, Jancy uses LLVM, which offers a reliable optimizer and native code generator for a wide variety of platforms. LLVM also significantly simplifies the process of ensuring ABI compatibility with C/C++.
As for the syntax analyzer we use our own generator of table-driven top-down LL(k) parsers called Graco. Read the next section for some thoughts on why we settled on this approach.
The output of the parser is LLVM IR – without the intermediate generation of an AST. Unlike with bottom-up parsers, in top-down parsers it is rather convenient to perform semantic analysis and generate IR in parallel while parsing, i.e. while the parser is matching the grammar rules.
Jancy grammar belongs to context-sensitive LL(2).
Syntactic/semantic analysis is performed in multiple passes (mostly two, with some rules requiring three passes). This, however, does not mean that the source is re-tokenized multiple times.
The second pass is needed to allow the usage of global entities (namespaces, types, variables, functions and properties) before their declaration, which might be below or even reside in a separate compilation unit. The third pass is necessary for the preliminary calculation of reactor class layouts.
Graco – LL(k) Grammar Compiler
The parser for the Jancy compiler is generated using Graco, our in-house builder of table-driven LL(k) parsers.
The main reason why we chose a generated parser while most production-level compilers favor recursive descent is this: we wanted to achieve the perfect syntax for our language. In order to succeed we had to go through a lot of trial and error experiments. We didn’t want to be bogged down in the routine of adjusting a recursive-descent parser. In addition, the generated parser provides two more benefits:
EBNF grammar as a permanently relevant syntax reference
Natural constraints to not let us get too crazy with the syntax.
What’s wrong with ANTLR, Coco, Yacc/Bison, Lemon and other undoubtedly respectable and tried-and-true parser generators is in itself an interesting subject. It is, however, the one for a separate article, as it would require detailed comparison with lots of code snippets. We will get to that some day. For now let’s just say that we needed a top-down parser generator with predictable and configurable conflict resolution mechanism and ANYTOKEN support – and we found that each existing parser generator couldn’t quite cut it at one situation or another.
The collaboration model between the parser and the lexer was “lifted” from Lemon. The Graco-generated parser does not call the lexer. Instead, the external loop asks the lexer to tokenize the next source chunk and feeds tokens into the table-driven parser. This model allows for an incremental parsing of incomplete compilation units (supports pause-and-resume, so to speak). Plus there is no need for artificial nesting level limitations due to the danger of a stack overflow (it’s simply not an issue here).