Representing data types and their objects in XML

27 August 2004

The eXtended Markup Language, XML, is not just a document structuring language, but also a way to represent data. Having an XML representation of data objects is useful for the exchange of data because many systems can read in data written in this form.

Therefore we have extended the BNF Converter with two flags, -xml and -xmlt, which represent grammars and abstract syntax trees in XML. For a given grammar Foo.cf, the following files are generated:

Foo.dtd, Document Type Declaration
XMLFoo.hs, a printer of trees into XML objects that are valid w.r.t. Foo.dtd

The test bench TestFoo.hs then also displays the XML representation of each tree. For the time being, the flags are only available in combination of the Haskell back end.

Goals

The purpose of the XML generator of BNFC is to use XML as yet another representation of abstract syntax, in addition to Haskell's algebraic data types, Java's classes, and C's unions of structures. Since algebraic datatypes are conceptually the semantics of all these, we will briefly discuss how algebraic datatype definitions and encoded in XML.

Two kinds of XML documents are to be generated:

DTDs = Document Type Declarations, from the algebraic datatype definitions themselves
XML elements, from trees built by constructors.

Checking the validity of an element with respect to a DTD is a central notion of XML. What we want to achieve is that

validity checking = type checking

To encode a type system with a DTD makes it slightly more complicated than would be needed otherwise, but we find this goal worth pursuing. We are still left with several alternative encodings.

Alternative encodings

The following examples are used to illustrate the alternative encodings.

  data Prop = Falsum | Verum | Conj Prop Prop | Disj Prop Prop

  Disj Verum Falsum

Constructors as tags, no types

This gives nice elements, but the DTD is clumsy, since every argument type of a constructor must be represented with the disjunction of all constructors. Moreover, there is no natural "top level" tag unless the type system explicitly includes a top type with just one "wrapper" constructor.

<!ELEMENT Verum EMPTY>
<!ELEMENT Falsum EMPTY>
<!ELEMENT Conj ((Verum|Falsum|Disj|Conj),(Verum|Falsum|Disj|Conj))>
<!ELEMENT Disj ((Verum|Falsum|Disj|Conj),(Verum|Falsum|Disj|Conj))>

<Disj>
  <Verum/>
  <Falsum/>
</Disj>

tags(f xs) = 2 + sum (tags xs)

Constructors and types as tags

This gives a natural-looking DTD, but the trees become bulky.

<!ELEMENT Verum EMPTY>
<!ELEMENT Falsum EMPTY>
<!ELEMENT Conj (Prop,Prop)>
<!ELEMENT Disj (Prop,Prop)>
<!ELEMENT Prop (Verum | Falsum | Disj | Conj)>

<Prop>
  <Disj>
    <Prop>
      <Verum/>
    </Prop>
    <Prop>
      <Falsum/>
    </Prop>
  </Disj>
</Prop>

tags (f xs) = 4 + sum (tags xs)

Types as tags, constructors as attributes

The trees look deceptively natural, but this is a non-starter since validation does not guarantee type correctness! The dual approach has the same problem.

<!ELEMENT Prop (() | (Prop, Prop))>
<!ATTLIST Prop name CDATA #required>

<Prop name="Disj">
  <Prop name = "Verum" />
  <Prop name = "Falsum" />
</Prop>

tags (f xs) = 2 + sum (tags xs)

Types as tags, constructors as empty elements

This has a good balance between natural DTD and tree size, but the introduction of empty elements to encode constructors feels like a hack and certainly not very XML-like.

<!ELEMENT Prop ((Verum) | (Falsum) | (Conj,Prop,Prop) | (Disj,Prop,Prop)>
<!ELEMENT Verum EMPTY>
<!ELEMENT Falsum EMPTY>
<!ELEMENT Conj EMPTY>
<!ELEMENT Disj EMPTY>

<Prop> <Disj/>
  <Prop> <Verum/> 
  </Prop>
  <Prop> <Falsum/> 
  </Prop>
</Prop>

tags (f xs) = 3 + sum (tags xs)

Alternatives chosen

We chose to provide two encodings in the BNFC-XML generator. They can be chosen by different flags:

-xml, constructors as tags, no types
-xmlt, types as tags, constructors as empty elements

The first encoding gives nice trees for exchange, but a horrible DTD. The second encoding gives a nice DTD, but bulky trees. Both encodings support type checking by validation.

Literals and token types

For literals and token types, both encodings use empty elements with type as tag and the content as value of the attribute value:

  <Integer value="123" />
  <Ident   value="x" />
  <String  value="aatu osaa soutaa" />

The corresponding DTD needs two clauses per type Foo.

  <!ELEMENT Foo EMPTY>
  <!ATTLIST Foo value CDATA #required>