Syntax 4.2
- Introduction
- Sample Output
- Concepts
- File Structure
- Lexer
- Production Rules
- Error Management
- Lexer Driven Parsing
Lexer Driven Parsing
Concepts
Compiler driven parsers are the standard. They get created, invoked with an input stream, which is then read one character at a time, checking for grammar compliance and generating code and other structures as a result.
Lexical driven parsers differ in the fact that the input stream is discontinuous, usually user driven. The parser gets created and then waits for an input to come as a method call, keeping state of where it has been. The parser will receive the input and check it for correctness, causing shifts and reduces with the given symbol. When a state is reached that requires a new symbol, the parser will stop, store its state, and return from the method call.
As part of keeping state, a lexical driven parser will be able to inform what symbols are possible in the next method call, thus allowing user interfaces to enable/disable symbols. The typical case that I have encountered in the past was a calculator with operators, numbers and parenthesis. The buttons get enabled only on the presence of valid symbol transitions on the current parser state. Consider the following calculator:
[1] [2] [3] [+]
[4] [5] [6] [-]
[7] [8] [9] [*]
[(] [0] [)] [/]
[ = ]
And the following grammar
%start S
%%
S : E
;
E : E '+' E
| E '-' E
| E '*' E
| E '/' E
| '(' E ')'
| num
;
%%
We can display a state where an expression has been seen as follows.
State x:
S -> E .
E -> E . + E
E -> E . - E
E -> E . * E
E -> E . / E
State y:
E -> ( E . )
E -> E . + E
E -> E . - E
E -> E . * E
E -> E . / E
The dot in the definitions denotes what would the parser had seen. State x signifies that the parser saw an expression (E), and that after that you can have the end of the expression or an operator (‘+’, ‘-‘, ‘*’, ‘/’). State x can then be seen as allowing an operator or the end of the input. State y on the other hand has been reached only when a parenthesis was used and thus a closing parenthesis is valid, but not the end of the input. Let’s show this in the calculator diagrams. Invalid tokens in a state will be shown as periods.
For state x:
. . . [+]
. . . [-]
. . . [*]
. . . [/]
[ = ]
And for state y:
. . . [+]
. . . [-]
. . . [*]
. . [)] [/]
...
As it can be seen, we can call the parser and command it to stop when a new symbol is needed (in this case user input.) Then we can check the valid tokens and enable the user interface elements as required.
Javascript Example
%{
if (!console) { // i.e. nashorn
var console = { log: print };
}
var calc = (function() {
%}
%class {
this.number = 0;
this.toString = function() {
return "number=" + this.number;
}
}
%left '+' : "plus"
, '-' : "minus"
%left '*' : "multiply"
, '/' : "divide"
%right TOK_UMINUS:"unary minus"
%token '('
, ')'
%token <number> TOK_NUMBER : "number"
%type <number> E
%name E : "expression";
%start S
%group OPS : "operator" '+', '-', '*', '/';
%%
S : E
;
E : E '+' E = $$ = $1 + $3;
| E '-' E = $$ = $1 - $3;
| E '*' E = $$ = $1 * $3;
| E '/' E = $$ = $1 / $3;
| '-' E %prec TOK_UMINUS = $$ = -$2;
| '(' E ')' = $$ = $2;
| TOK_NUMBER
;
%%
// END OF GRAMMAR
function parserError(state, token, top, message) {
console.log("An error occurred in position " + charNum + " with token " + getTokenName(token));
console.log(message);
return ERROR_RE_ATTEMPT;
}
return {
setVerbose: setVerbose,
getTokenName: getTokenName,
getTokenFullName: getTokenFullName,
getTokenIndex: getTokenIndex,
parse: function (token, value) {
var lexicalValue = new StackElement();
lexicalValue.number = value;
var numericToken;
if (typeof token == 'number') {
numericToken = token;
}
else {
numericToken = token.charCodeAt(0);
}
return parse(numericToken, lexicalValue);
},
init: init,
getValidTokens: getValidTokens,
getResult: getResult
}
})();
var getValidTokens = function() {
var tokens = calc.getValidTokens();
var alpha = [];
tokens.forEach(function(token) {
token = calc.getTokenName(token);
alpha.push(token);
})
return alpha.join(",");
}
var getResult = function() {
return calc.getResult().number;
}
var parseOne = function(token, value) {
var rc = calc.parse(token, value);
console.log("parsed " + token + " with value " + value + " with status=>" + rc);
var tokens = getValidTokens();
console.log(" >", tokens);
}
calc.init();
parseOne('(', 0);
parseOne(32769, 1); // number 1
parseOne('+', 0);
parseOne(32769, 3); // number 3
parseOne(')', 0);
parseOne('*', 0);
parseOne(32769, 4); // number 4
parseOne('/', 0);
parseOne(32769, 5); // number 5
parseOne('+', 0);
parseOne(32769, 20); // number 20
parseOne(0, 0);
console.log("Result=", getResult());
As can be seen, the syntax file looks identical to a standard parser grammar, minus the lexical analyzer. No lexer code, nor regular expressions are needed anymore.
This example uses some routines that may need an explanation.
- init() has to be called on the parser. It can be called multiple times but everytime it would reset the parser to its initial state.
- getValidTokens() obtains the possible tokens in the parser given its current state. It then converts the tokens to their token names by calc.getTokenName(), returning a single string separated by commas, ready for display.
- parseOne() is just a utility routine that calls the parser with the current token and a value. It just logs the token, value and status returned by the scanner, plus the valid tokens.
- a parse function interface routine is created that for our example converts characters to their token number and creates an object of the StackElement type:
parse: function (token, value) {
var lexicalValue = new StackElement();
lexicalValue.number = value;
var numericToken;
if (typeof token == 'number') {
numericToken = token;
}
else {
numericToken = token.charCodeAt(0);
}
return parse(numericToken, lexicalValue);
},
You can now run it as follows:
java -jar syntax-<version>.jar --language js --driver scanner Calc.sy
jjs Calc.js
Notice the –driver option given
The output of the execution gives:
parsed ( with value 0 with status=>2
> -,(,TOK_NUMBER
parsed 32769 with value 1 with status=>2
> +,-,*,/,)
parsed + with value 0 with status=>2
> -,(,TOK_NUMBER
parsed 32769 with value 3 with status=>2
> +,-,*,/,)
parsed ) with value 0 with status=>2
> EOS,+,-,*,/
parsed * with value 0 with status=>2
> -,(,TOK_NUMBER
parsed 32769 with value 4 with status=>2
> EOS,+,-,*,/
parsed / with value 0 with status=>2
> -,(,TOK_NUMBER
parsed 32769 with value 5 with status=>2
> EOS,+,-,*,/
parsed + with value 0 with status=>2
> -,(,TOK_NUMBER
parsed 32769 with value 20 with status=>2
> EOS,+,-,*,/
parsed 0 with value 0 with status=>1
> EOS,+,-,*,/
Result= 23.2