New Pragmas in the BNF Converter

By Aarne Ranta, September 19, 2003.

This document is a supplement to the BNFC documentation, aimed to explain the new pragmas introduced in versions 1.3 and 1.9b. Its content will be incorporated in the BNFC report later.

New pragmas in BNFC version 1.3

These pragmas do not add to the expressive power of BNFC, but are just shorthands for groups of rules.

Terminators and separators

The terminator pragma defines a pair of list rules by what token terminates each element in the list. For instance,

  terminator Stm ";" ;

tells that each statement (Stm) is terminated with a semicolon (;). It is a shorthand for the pair of rules

  [].  [Stm] ::= ;
  (:). [Stm] ::= Stm ";" [Stm] ;

The qualifier nonempty in the pragma makes one-element list to be the base case. Thus

  terminator nonempty Stm ";" ;

is shorthand for

  (:[]). [Stm] ::= Stm ";" ;
  (:).   [Stm] ::= Stm ";" [Stm] ;

The terminator can be specified as empty "". No token is introduced then, but e.g.

  terminator Stm "" ;

is translated to

  [].  [Stm] ::= ;
  (:). [Stm] ::= Stm [Stm] ;

The separator pragma is similar to terminator, except that the separating token is not attached to the last element. Thus

  separator Stm ";" ;

means

  [].    [Stm] ::= ;
  (:[]). [Stm] ::= Stm ;
  (:).   [Stm] ::= Stm ";" [Stm] ;

whereas

  separator nonempty Stm ";" ;

means

  (:[]). [Stm] ::= Stm ;
  (:).   [Stm] ::= Stm ";" [Stm] ;

Notice that, if the empty token "" is used, there is no difference between terminator and separator.

Problem. The grammar generated from a separator without nonempty will actually also accept a list terminating with a semicolon, whereas the pretty printer "normalizes" it away. This might be considered as a bug, but a set of rules forbidding the terminating semicolon would be much more complicated. The nonempty case is strict.

Coercions

The coercions pragma is a shorthand for a group of rules translating between precedence levels. For instance,

  coercions Exp 3 ;

is shorthand for

  _. Exp  ::= Exp1 ;
  _. Exp1 ::= Exp2 ;
  _. Exp2 ::= Exp3 ;
  _. Exp3 ::= "(" Exp ")" ;

Because of the total coverage of these coercions, it does not matter in practice if the integer indicating the highest level (here 3) is bigger than the highest level actually occurring, or if there are some other levels without productions in the grammar.

Rules

The rules pragma is a shorthand for a set of rules from which labels are generated automatically. For instance,

  rules Type ::= "int" | "float" | "double" | "long" ;

is shorthand for

  Type_int.    Type ::= "int" ;
  Type_float.  Type ::= "float" ; 
  Type_double. Type ::= "double" ; 
  Type_long.   Type ::= "long" ;

The labels are created automatically. If the production has just one item, the label looks natural. If it is longer, the type name indexed with an integer is used. No global checks are performed when generating these labels. Any label name clashes that result from them are captured by BNFC type checking on the generated rules.

New pragmas in BNFC version 1.9b: layout syntax

Those who do not know what layout syntax is or who do not like it can skip this section.

These new pragmas define a layout syntax for a language. Before these pragmas were added, layout syntax was not definable in BNFC. The layout pragmas are only available for the files generated for Haskell-related tools; if Java or C++ programmers want to handle layout, they can use the Haskell layout resolver as a preprocessor to their front end, before the lexer. In Haskell, the layout resolver appears, automatically, in its most natural place, which is between the lexer and the parser. The layout pragmas of BNFC are not powerful enough to handle the full layout rule of Haskell 98, but they suffice for the "regular" cases.

Here is an example, found in the grammar layout/Alfa2.cf.

  layout "of", "let", "where", "sig", "struct" ;

The first line says that "of", "let", "where", "sig", "struct" are layout words, i.e. start a layout list. A layout list is a list of expressions normally enclosed in curly brackets and separated by semicolons, as shown by the Alfa example

  ECase. Exp ::= "case" Exp "of" "{" [Branch] "}" ;

  separator Branch ";" ;

When the layout resolver finds the token of in the code (i.e. in the sequence of its lexical tokens), it checks if the next token is an opening curly bracket. If it is, nothing special is done until a layout word is encountered again. The parser will expect the semicolons and the closing bracket to appear as usual.

But, if the token t following of is not an opening curly bracket, a bracket is inserted, and the start column of t is remembered as the position at which the elements of the layout list must begin. Semicolons are inserted at those positions. When a token is eventually encountered left of the position of t (or an end-of-file), a closing bracket is inserted at that point.

Nested layout blocks are allowed, which means that the layout resolver maintains a stack of positions. Pushing a position on the stack corresponds to inserting a left bracket, and popping from the stack corresponds to inserting a right bracket.

Here is an example of an Alfa source file using layout:

  c :: Nat = case x of 
    True -> b
    False -> case y of
      False -> b
      Neither -> d

  d = case x of True -> case y of False -> g
                                  x -> b
                y -> h

Here is what it looks like after layout resolution:

   c :: Nat = case x of {
    True -> b
    ;False -> case y of {
      False -> b
    };Neither -> d
  
  };d = case x of {True -> case y of {False -> g
                                  ;x -> b
                };y -> h} ;

Hint. It is good practice to start a new line after any layout word, to guarantee alpha convertibility. For instance, if you change the variable name x to foo, the second definition above becomes syntactically incorrect, whereas the first one remains correct.

There are two more layout-related pragmas. The layout stop pragma, as in

  layout stop "in" ;

tells the resolver that the layout list can be exited with some stop words, like in, which exits a let list. It is no error in the resolver to exit some other kind of layout list with in, but an error will show up in the parser.

The layout toplevel pragma tells that the whole source file is a layout list, even though no layout word indicates this. The position is the first column, and the resolver adds a semicolon after every paragraph whose first token is at this position. No curly brackets are added. The Alfa file above is an example of this, with two such semicolons added.

To make layout resolution a stand-alone program, e.g. to serve as a preprocessor, the programmer can modify the file layout/ResolveLayoutAlfa.hs as indicated in the file, and either compile it or run it in the Hugs interpreter by

  runhugs ResolveLayoutX.hs <X-source-file>

We may add the generation of ResolveLayoutX.hs to a later version of BNFC.

Bug. The generated layout resolver does not work correctly if a layout word is the first token on a line.