A BNF Grammar Converter

Aarne Ranta and Markus Forsberg, February 1 - March 22, 2002.

Versions: 22/3 improve lexer. 8/3 remove some obsolete code, improve pretty-printer, add examples. 21/2 add comment pragma. 20/2 add comments to BNF files. 17/2 generate a pretty-printer (experimental). 14/2 generate a case skeleton. 12/2 add grammar checking, simplify coercion and list conventions. 9/2 add convention about non-parsing rules, filter out non-existing literal types from lexer. 8/2 generate Latex files. 5/2 generate Alex files, indicate line numbers at parse errors. 4/2 add support for lists, extend the token type. 1/2/2002 first version.

What it does

This is a program that reads a Labelled BNF grammar file Foo.cf, such as this example, and produces six things:

the file AbsFoo.hs, an abstract syntax specification as a Haskell module (example),
the file SkelFoo.hs, a case skeleton for the abstract syntax as a Haskell module (example).
the file LexFoo.x, an Alex lexer generator file (example),
the file ParFoo.y, a Happy parser generator file (example),
the file PrintFoo.y, a pretty-printer as a Haskell module (example),
the file LatexFoo.tex, a Latex file containing a readable specification of the language (example in postscript).
an interactive parser session.

The program is implemented in Haskell, and can be used from within the interpreter Hugs, or compiled; the main module is CFTop. You also need Alex (Version 1.1) and Happy (Version 1.11); older versions may work incorrectly with the BNF Converter.

Examples

Alfa, a version of constructive type theory (document).

C4, a fragment of C (see "What it does" above).

How to get it?

It's here! It's a tar ball with all source files and examples, including a C program that C4.cf manages to parse. As well as this document.

Why it is there

The purpose is to enable syntax definitions on a high level, so that there is a single source for the abstract syntax, the pretty-printer, the parser, and the lexer, as well as for the documentation of the language. In traditional compiler construction, these five things require four different sources.

Moreover, even though Happy does a great job in generating parsers LALR(1), writing a Happy parser by hand is unnecessarily low-level if all you want the parser to return is an abstract syntax tree. Only wanting this is part of the modern multi-phase compilation ideology: you should not define your compiler back-end in the semantic actions of Happy.

Of course, if you do want to use the semantic actions of Happy to do more than just construct abstract syntax trees, you will find the BNF converter too restricted.

This program is a spin-off of a much larger tool, GF, which has a grammar formalism richer than BNF and was particularly designed for natural languages. With the addition of a GF-to-Happy converter, in combination with an earlier BNF-to-GF converter, it turned out that GF is almost usable as a compiler tool, as well - but only almost, if one wants to create components that fit seamlessly into the traditional way of writing compilers in Haskell (e.g. use Haskell's lists and integers in syntax trees, and to get rid of some concrete syntax clutter). Rather that adding such ad hoc things to GF itself, we wrote this ad hoc tool where such facilities are provided by ad hoc conventions.

Converting BNF files

Start Hugs and load in CFTop. Then read in your BNF grammar file, convert it to other formats, and automatically process the resulting files.

  $ hugs
  ...

  Prelude> :l CFTop
  ...

  Main> makeAll "C4"
  34 rules accepted
  wrote file AbsC4.hs
  wrote file LexC4.x
  wrote file ParC4.y
  wrote file LatexC4.tex
  wrote file SkelC4.hs
  wrote file PrintC4.hs
  Executing alex
  Executing happy
  unused terminals: 1
  Executing latex
  Executing xdvi
  to get started, import the parser in hugs by :l ParC4

  Main> :l ParC4
  ...

  ParC4> readFile "koe2.c" >>= print . pProgr . myLexer
  Ok (Program [Decl TInt [Ident "i"],Decl TInt [Ident "k"]] 
  [SAss (Ident "i") (EInt 1),SAss (Ident "k") (EVar (Ident "i")),
  SWhile (EOp (EVar (Ident "i")) OLt (EInt 13)) (SBlock [SAss (Ident "k") 
  (EOp (EVar (Ident "k")) OTimes (EVar (Ident "i"))),SIncr (Ident "i"),
  SPrint (EVar (Ident "k"))]),SReturn (EInt 0)])

Notice that you need hugs, alex, happy, latex, and xdvi. The Alex runtime file Alex.hs must be on your Hugs path when running the BNF converter.

When run on the file Foo.cf, makeAll writes six files, AbsFoo.hs, SkelFoo.hs, PrintFoo.hs, LexFoo.x, LatexFoo.tex and ParFoo.y.

The files LexFoo.x and ParFoo.y are compiled into a production-quality lexer and parser by using alex and happy. If this goes well, it produces two huge Haskell files, LexFoo.hs and ParFoo.hs, which define a lexer and an LALR(1) parser corresponding to your BNF grammar. The file can then be compiled (or hugs-interpreted) in the usual way. The most important functions they define are

for all categories X, a parser pX :: [Tok] -> Maybe X
a lexer myLexer :: String -> [Tok]

The file SkelFoo.hs defines case skeletons for the abstract syntax in AbsFoo.hs. This file defines a function:

transX :: X -> Result

for every category X, containing a case expression with a case for every constructor of type X (in AbsFoo.hs). Redefine

type Result = ..

to match your translation needs.

The file PrintFoo.hs defines a pretty-printer class Print and instantiates it for every datatype defined in AbsFoo.hs. The top-level function is

printTree :: Print a => a -> String

which uses the lower-level methods prt and prtList of the Print class, which produce token lists that are fed into the rendering function render. The pretty-printer uses the BNF rules, trying to infer the precedence from the numbering of the categories (cf.\ the section "Coercion conventions" below). It uses ordinary parentheses () to express those precedences; to change this, change the function parenth in the printer file. If coercions are used for something else than precedence, the pretty-printer may give strange results. This reflects the fact that BNF is not powerful enough simultaneously to express abstract syntax and pretty-printing (which is why GF is there).

The documentation, describing the language defined by your BNF grammar, is produced by compiling LatexFoo.tex with latex. If everything works fine, latex outputs a file LatexFoo.dvi that can be viewed with xdvi. If you prefer postscript, convert the output with the command dvips LatexFoo.dvi -o LatexFoo.ps.

Interactive parsing session

As an alternative to converting into Alex and Happy (for instance, in the development phase, or if you don't have these tools available), you can enter an interactive parsing session:

  Main> cfTop "Foo.cf"
  27 rules

This drops you into a subshell where you can enter strings and get them parsed in the value category of the first rule of your grammar, in this case Prog.

  Prog? <koe1.c
  Program [] []

  Prog? >Exp

  Exp? 2 + 2
  Sum Two Two

  Exp? .

  Main>

The example illustrates the four available commands:

<file parses the contents of file,
>Cat changes the start category of parsing to Cat,
. (a dot) quits the parser,
any other string is parsed in the current start category.

The parser is a top-down combinator parser which loops with left-recursive grammars (such as the example). Its way of parsing is thus different from Happy's; but it has the advantage of also accepting ambiguous grammars and showing all different results. The lexer is a hand-written Haskell function.

Within Hugs, it is easy to use the Happy parser interactively, as well, since parsers are exported for every category. For example:

  ParC4> return "2*(x+3)" >>= print . pExp . myLexer
  Ok (EOp (EInt 2) OTimes (EOp (EVar (Ident "x")) OPlus (EInt 3)))

The labelled BNF format

The BNF grammars recognized by the converter consist of rules, one per line, which have the following format:

  Ident"." Ident "::=" (Ident | Quoted)*

The first identifier is a rule label, which comes out as a constructor in the generated abstract syntax. The second identifier is a category symbol, which is any identifier. Then comes the symbol ::=, and finally a list of items separated by blanks. An item is either a category symbol (nonterminal) or a quoted string (terminal). For example, the following rule defines C-like if-else statements.

  SIf. Stm ::= "if" "(" Exp ")" Stm "else" Stm

Comments in a BNF grammar are like in Haskell: --... for single-line comments and {- ... -} for possibly multiple-line comments.

 -- This is a single-line comment.
 {- This is a multiple-line
    comment. -}

Rule names and category symbols can be chosen ad libitum, with the restrictions imposed by Haskell (if you want to use the generated Happy and Haskell files), and some other conventions:

A category symbol is a nonempty sequence of letters, starting with a capital letter.
A rule label is a nonempty sequence of letters, starting with a capital letter.

The program extracts an abstract syntax from the BNF grammar by treating categories as data types and rule names as constructors. For instance, the SIf rule extracts one branch in the definition of Stm:

  data Stm = ... | SIf Exp Stm Stm | ...

Certain identifiers are treated in special ways, as explained below: Char, Double, Int, String, Ident.

Coercion conventions

Not to clutter abstract syntax with too much concrete-syntax detail, we extend the set of category symbols with numbered variants:

A category symbol can end with a digit, and is then treated as a type synonym of the corresponding non-digited symbol. E.g. Exp1, Exp8 are synonyms of Exp.

A typical use of this convention is to distinguish between different precedence classes. For instance, the plus operator can be defined to connect Exp1 and the times operator Exp2. In fact, associativity can be defined with the same mechanism, as show in the example.

To type-control the usage of coercions, we introduce the notion of the category skeleton of a BNF rule. The category skeleton is a pair

  (C, C1...Cn)

where C is the undigited left-hand-side nonterminal and C1...Cn is the sequence of undigited right-hand-side nonterminals of the rule. The category skeleton is the structure that is used for producing a constructor in the abstract syntax, with value type C and argument types C1,...,Cn.

Precedence classes usually require semantically dummy transitions, e.g. from Exp1 to Exp3 by the addition of parentheses, and in the opposite direction for free. Since we don't want to use constructors constructors for such transitions, we introduce the following convention:

A rule label can be replaced with an underscore _, which is treated as a coercion and does not add anything to the syntax tree.

A coercion rule is well-typed only if its category skeleton has the form

  (C,C).

Digited variants of categories can be used for other purposes than precedence, provided that the type-skeleton is correct. This is OK for the parser, but the pretty-printer may get confused, since it asssumes that digits are used for precedence only.

Sometimes it is necessary to use a constructor in many BNF rules, between different precedence classes. This is accepted: it is only required that all uses of the constructor have the same category skeleton.

Hidden constructors

Rules with the hash symbol # (unquoted) appearing in the beginning of the rhs are not sent to the parser generator, but only to the abstract syntax; the hash symbol is suppressed. This convention permits the inclusion of abstract syntax that is not parsable but used for internal representation, e.g. in a later type checking phase.

Predefined basic types

The BNF converter provides predefined lexers and parsers for integers, floats, characters, and quoted strings, and interprets them directly as Haskell data objects. Thus

The category names Char, Double, Int, String denote corresponding Haskell types. They should not be used as left-hand-side categories, except in coercion rules.

Moreover, there is a predefined lexer and parser for identifiers (unquoted strings that are not reserved words):

The category name Ident denotes a hard-coded identifier type and should not be used as a left-hand-side category, except in coercion rules. Neither can Ident be used as a rule label.

The abstract syntax automatically includes the datatype definition

  data Ident = Ident String

For example, the BNF rules

  EVar. Exp ::= Ident
  EInt. Exp ::= Int

generate the following abstract syntax:

  data Exp = ... | EVar Ident | EInt Int | ...

List conventions

While it is possible to define one's own monomorphic list types by introducing constructors, it is often handy to use Haskell's polymorphic lists. The BNF converter supports Haskell's lists by accepting the following category symbols and constructors:

[X], the category of lists of type X.
[] and (:), the Nil and Cons constructors, respectively.

We also make it possible for a grammar to recognize non-empty lists only:

(:[]) is a constructor for one-element lists.

The list conventions of course only produce type-correct Haskell code if the names are used in certain ways:

A category [X] may not have other constructors than [], (:), and (:[]).
The constructor [] must have category skeleton of form ([C],).
The constructor (:) must have category skeleton of form ([C], C [C]).
The constructor (:[]) must have category skeleton of form ([C], C).

In the Latex document (for stylistic reasons) and in the Happy file (for syntactic reasons), the category name [X] is replaced by ListX. This may cause clashes, if ListX is at the same time used explicitly in the grammar

The lexer

The lexer uses a token type Lexer.Tok with four kinds of tokens:

TS String: reserved words and symbols (inferred from the grammar)
TI String: positive integer literals, parsed as Int
TL String: quoted string literals, parsed as String
TC String: character literals, parsed as Char
TD String: double-precision float literals, parsed as Double
TV String: identifiers, parsed as Ident

The writer of a BNF grammar need not know about these token constructors; we have listed them to enable hand-tuning of the lexer and the parser.

The set of reserved words is the set of terminals appearing in the grammar. Those reserved words that consist of non-letter characters are treated in a different way from those that are similar to identifiers. The lexer tries to follow normal rules familiar from languages like Haskell, C, and Java, including longest match and spacing conventions.

Comments are like in Haskell: {- ... -} and --... are treated as comments in the lexers produced. To change this, you have to define a comment pragma in your BNF grammar file (currently the only pragma in the system): put anywhere in the code either of

line-tail comments:
# comment (Quoted String)
enclosed comments:
# comment (Quoted String) (Quoted String)

The strings in enclosed comments must be, besides the quotes, no longer than two character. If this is to restrictive, then edit the LexFoo.x file.

So if we would like to make our Haskell comment convention explict, we could write:

# comment "--"
# comment "{-" "-}"

Future work

Happy's left-recursive lists should be supported.

Generate XML.

Generate syntax editors?