BNFC can now output parser files compatible with the GLR mode of the Happy parser generator for Haskell. This enables ambiguous grammars to be parsed and for all valid parses to be examined.
Further information on the GLR mode can be found in Happy's
GLR documentation, supplemented by information on
Paul Callaghan's
Happy-GLR information page.
bnfc -glr [OPTIONS] FILE.cfUsage is almost identical to normal Happy mode, except for a new flag -glr which causes a few minor changes in creation of the Happy parser file (mainly additional type annotations) and of the default test code. The latter involves changing the way the generated parser is called, and is explained later.
No other aspect of BNFC is affected, so functionality remains as expected.
For more information, refer to the documentation on the
BNFC page.
happy --glr --decode [--ghc] ParNAME.yThe optional --ghc flag turns on certain optimisations for the GHC compiler. Without this flag, the Happy-generated parsers should be compatible with Hugs in -98 mode: although large parsers may be rejected by Hugs as "too complex", and other BNFC output code may require conditional compilation (but some hand-editing may suffice here).
This invocation of happy will generate two files, called
ParNAME.hs and ParNAMEData.hs, which are the parser
driver and parser data respectively. They are in separate files because
the former needs extensive optimization, but the latter may be large
and thus too expensive to optimize. These files can be compiled with
GHC in the normal way, although sometimes you may need the
-fglasgow-exts file to permit certain instance declarations
created in the parser.
For GLR parsing, there are two possible errors: premature end of input or (global) syntax error. The former should be clear, but the second may require explanation. When ambiguity arises, GLR parsers allow several interpretations to continue in parallel. Some of these may be discarded when new input is incompatible with the interpretation. This is a syntax error for this interpretation, but the only action is to discard the interpretation. No warning or error is generated, because it doesn't necessarily mean the overall parse will fail. However, if all interpretations are dropped, then this is a global syntax error - no parses are possible.
If the parse succeeds and produces exactly one abstract syntax tree, then
this is shown. If more than one tree is produced, these are shown in turn,
numbering each tree to aid identification and to illustrate how many
different parses succeeded.
type ParseFun a = [[Token]] -> (GLRResult, GLR_Output (Err a))(Simplified) parsers obey the pattern represented by the ParseFun synonym above. The input is a list of list of tokens, where the outer list corresponds to input positions and the inner list corresponds to different interpretations of a particular input symbol. In Natural Language terms, the inner list allows representation of morphological ambiguity, for example the noun and verb senses of run.
The output is the standard parsing result GLRResult plus a more convenient form GLR_Output (Err a). The GLR_Output type allows representation of parse fail (with error message(s)), or a success (with useful results). It is polymorphic, in order to support reuse, and the occurrence of Err reflects use of that monad within BNFC.
The fail case of GLR_Output has a main error message plus a second string for supplementary information, e.g. more detailed output. The main part of the success case is a list of `semantic' results (which are abstract syntax trees in the standard BNFC setting), so this provides convenient access to the parse results. Also provided is a function of type (Forest -> Forest) -> [a], which allows modification of the parse forest to be done before the decoding to semantic results. This is useful if some pruning or reordering of results is required before the final trees are produced; an example of this is given below.
type Forest = FiniteMap ForestId [Branch] data GLR_Output a = GLR_Result { pruned_decode :: (Forest -> Forest) -> [a] , semantic_result :: [a] } | GLR_Fail { main_message :: String , extra_info :: String }Parsers exported by the GLR driver code can be converted to the simple form with the following lifting operation (the code is provided in the TestNAME.hs files).
lift_parser :: (TreeDecode a, Show a, Print a) => ([[Token]] -> GLRResult) -> ParseFun aFinally, it is necessary to create the simplified parser object before using it in your code. The following idiom is recommended, where TOPTYPE is the root type in the abstract syntax trees and TOPSYMBOL is the name of the topmost (or main) rule in the BNFC input grammar. This causes the main parser to be lifted appropriately, and via the type signature it fixes the type of objects to be produced by decoding. (This is required because of the use of type classes.)
the_parser :: ParseFun TOPTYPE the_parser = lift_parser pTOPSYMBOL
Plus . Expr ::= Expr "+" Expr Lit . Expr ::= IntegerAs an artificial example, suppose we wanted to limit the interpretations from the first rule to the ones where the left tree covers at least as much input as the right tree (which we can tell by inspecting the forest indices). This can be achieved with code like the following.
remove_light_lefts :: Forest -> Forest remove_light_lefts = mapFM foo where foo k bs = [ b | b <- bs , case b_nodes b of [_] -> True -- an Int [(a,b,_),_,(c,d,_)] -> b - a >= d - c ]In this context, mapFM applies a function of type ForestId -> [Branch] -> [Branch] to each node in the forest, where the ForestId is the node index and the second argument its current value, and the result is the new value. The code above only keeps a branch with three children whose left side is wider or the same as its right side. Note that forest indices contain start and end positions, so width is easy to calculate. You can also look at the last item of the index 3-tuples to get information about grammatical categories.
For brief reference, the relevant types are listed below. GSymbol is an enumeration of names from the grammar, each name prefixed with G_ to avoid name clashes, or it is an embedded token value. Semantic information is held in the GSem values but in general these are functions and not intended for general use (but see comment below).
type Forest = FiniteMap ForestId [Branch] type ForestId = (Int,Int,GSymbol) data Branch = Branch {b_sem :: GSem, b_nodes :: [ForestId]} deriving Show data GSymbol = {automatically generated by Happy-GLR} deriving Show data GSem = {automatically generated by Happy-GLR} deriving Show
Some final comments on deeper technical aspects of this technique: