Representing data types and their objects in XML
Aarne Ranta
27 August 2004
The eXtended Markup Language, XML, is not just a document structuring
language, but also a way to represent data. Having an XML
representation of data objects is useful for the exchange of
data because many systems can read in data written in this form.
Therefore we have extended the BNF Converter with two flags,
-xml and -xmlt, which represent grammars
and abstract syntax trees in XML. For a given grammar Foo.cf,
the following files are generated:
- Foo.dtd, Document Type Declaration
- XMLFoo.hs, a printer of trees into XML objects that are
valid w.r.t. Foo.dtd
The test bench TestFoo.hs then also displays the XML
representation of each tree.
For the time being, the flags are only available in combination of
the Haskell back end.
Goals
The purpose of the XML generator of BNFC is to use XML as yet
another representation of abstract syntax, in addition to
Haskell's algebraic data types, Java's classes, and C's
unions of structures. Since algebraic datatypes are conceptually
the semantics of all these, we will briefly discuss how algebraic
datatype definitions and encoded in XML.
Two kinds of XML documents are to be generated:
- DTDs = Document Type Declarations, from the algebraic datatype
definitions themselves
- XML elements, from trees built by constructors.
Checking the validity of an element with respect to
a DTD is a central notion of XML. What we want to achieve is that
validity checking = type checking
To encode a type system with a DTD makes it slightly more complicated
than would be needed otherwise, but we find this goal worth pursuing.
We are still left with several alternative encodings.
Alternative encodings
The following examples are used to illustrate the alternative encodings.
data Prop = Falsum | Verum | Conj Prop Prop | Disj Prop Prop
Disj Verum Falsum
Constructors as tags, no types
This gives nice elements, but the DTD is clumsy, since every argument
type of a constructor must be represented with the disjunction of all
constructors. Moreover, there is no natural "top level" tag unless
the type system explicitly includes a top type with just one "wrapper"
constructor.
<!ELEMENT Verum EMPTY>
<!ELEMENT Falsum EMPTY>
<!ELEMENT Conj ((Verum|Falsum|Disj|Conj),(Verum|Falsum|Disj|Conj))>
<!ELEMENT Disj ((Verum|Falsum|Disj|Conj),(Verum|Falsum|Disj|Conj))>
<Disj>
<Verum/>
<Falsum/>
</Disj>
tags(f xs) = 2 + sum (tags xs)
Constructors and types as tags
This gives a natural-looking DTD, but the trees become bulky.
<!ELEMENT Verum EMPTY>
<!ELEMENT Falsum EMPTY>
<!ELEMENT Conj (Prop,Prop)>
<!ELEMENT Disj (Prop,Prop)>
<!ELEMENT Prop (Verum | Falsum | Disj | Conj)>
<Prop>
<Disj>
<Prop>
<Verum/>
</Prop>
<Prop>
<Falsum/>
</Prop>
</Disj>
</Prop>
tags (f xs) = 4 + sum (tags xs)
Types as tags, constructors as attributes
The trees look deceptively natural, but this is a non-starter since
validation does not guarantee type correctness! The dual approach has
the same problem.
<!ELEMENT Prop (() | (Prop, Prop))>
<!ATTLIST Prop name CDATA #required>
<Prop name="Disj">
<Prop name = "Verum" />
<Prop name = "Falsum" />
</Prop>
tags (f xs) = 2 + sum (tags xs)
Types as tags, constructors as empty elements
This has a good balance between natural DTD and tree size, but
the introduction of empty elements to encode constructors feels
like a hack and certainly not very XML-like.
<!ELEMENT Prop ((Verum) | (Falsum) | (Conj,Prop,Prop) | (Disj,Prop,Prop)>
<!ELEMENT Verum EMPTY>
<!ELEMENT Falsum EMPTY>
<!ELEMENT Conj EMPTY>
<!ELEMENT Disj EMPTY>
<Prop> <Disj/>
<Prop> <Verum/>
</Prop>
<Prop> <Falsum/>
</Prop>
</Prop>
tags (f xs) = 3 + sum (tags xs)
Alternatives chosen
We chose to provide two encodings in the BNFC-XML generator.
They can be chosen by different flags:
- -xml, constructors as tags, no types
- -xmlt, types as tags, constructors as empty elements
The first encoding gives nice trees for exchange, but a horrible
DTD. The second encoding gives a nice DTD, but bulky trees.
Both encodings support type checking by validation.
Literals and token types
For literals and token types, both encodings use empty elements
with type as tag and the content as value of the attribute
value:
<Integer value="123" />
<Ident value="x" />
<String value="aatu osaa soutaa" />
The corresponding DTD needs two clauses per type Foo.
<!ELEMENT Foo EMPTY>
<!ATTLIST Foo value CDATA #required>