Maximizing Composability and Relax NG Trivia
Here's some interesting stuff about the design and development of Relax NG:
James Clark wrote about maximizing composability:
First, a little digression. In general, I have made it a design principle in TREX to maximize "composability". It's a little bit hard to describe. The idea is that a language provides a number of different kinds of atomic thing, and a number different ways to compose new things out of other things. Maximizing composability means minimizing restrictions on which ways to compose things can be applied to which kinds of thing. Maximizing composability tends to improve the ratio between functionality on the one hand and simplicity/ease of use/ease of learning on the other.
Clark describes the derivative algorithm's lazy approach to automaton construction:
I don't agree that <interleave> makes automation-based implementations impossible; it just means you have to construct automatons lazily. (In fact, you can view the "derivative"-based approach in JTREX as lazily constructing a kind of automaton where states are represented by a canonical representative of the patterns that match the remaining input.)
The Relax NG derivative algorithm is implemented in a few hundred elegent declarative functional lines of Haskel, and also in tens of thousands of lines and hundreds of classes of highly abstract complex Java code.
Clark's Java implementation of Relax NG is called "jing", which is a Thai word meaning truthful, real, serious, no-nonsense, and ending with "ng".
Comparing the Java and Haskell implementations of Relax NG illustrates what a wicked cool and powerful language Haskell really is. The Java code must explicitly model and simulate many Haskel features like first order functions, memoization, pattern matching, partial evaluation, lazy evaluation, declarative programming, and functional programming. That requires many abstract interfaces,, concrete classes and brittle lines of code.
While the Java code is quite brittle and verbose, the Haskell code is extremely flexible and concise. Haskell is an excellent design language, a vehicle for exploring complex problem spaces, designing and testing ingenious solutions, performing practical experiments, weighing trade-offs, and writing succinct, elegant, mathematically rigorous specifications that actually work. Haskell code is useful as a blueprint for implementations in less luxurious languages like Java.
Murata Makoto writes about the differences of implementations of TREX and Relax:
Somebody asked differences between RELAX Core V1 implementations and TREX implementations.
RELAX Core V1 has three implementations of verifiers. One (VBRELAX) is similar to PyTREX; it backtracks. The other two implementations (RELAX Verifier for C++ and RELAX Verifier for Java) are based on non-deterministic bottom-up tree automata; for each element, multiple states (or non-terminals or labels) are assigned.
In my understanding, JamesC's implementation of TREX also uses non-deterministic bottom-up tree automata. However, James constructs tree automata lazily. Consequently, his implementation is extremely fast when your schema is large but your instance is small.
On the other hand, some constructs of TREX make it hard to actually create tree automata in advance. I am concerned, since creation of tree automata is required for type inference (note: XML Query people plan to incorporate type inference into their query language). I am hoping that introduction of reasonable restrictions will make it possible to construct tree automata in advance.
Relax NG is the result of unifying TREX and Relax:
JJC reported that he had met with Murata-san (the designer of RELAX) to discuss unification of TREX and RELAX, and that their conclusion was that they felt it was probably possible to arrive at a design that was satifactory to both. There was a general feeling that unification was politically desirable. It was unanimously decided to change the charter of the TC to make the objective be the development of a unified language.
Relax NG is based on "Hedge Automata Theory" -- so what are hedges and why are they called that?
Murata Makoto wrote: I would like to talk about terminology. I have used "hedges" to refer to zero or more trees possiblly prepended, interespersed, and followed by characters. Your <a,c> is more than my hedges; it is a hedge together with a collection of attributes. You wrote "tree", but it is very misleading. We need a better term. How about "attributed hedge"?
James Clark wrote: Attributed hedge is a bit of a mouthful. How about simply forest?
Murata Makoto wrote: Derick Wood strongly believes that a forest is a set of trees rathert than a sequence of trees. In fact, many articles have used forests to mean sets of trees. How about simply hedge or orchard?
James Clark wrote: I beg to differ. I think the term forest is equally appropriate for ordered and unordered collections of trees, which is exactly what is needed for TREX since the thing we are naming is a pair of an unordered collection and an unordered collection [(sic) -- I think he means "ordered collection" -Don]. Forest have also been used for ordered collections of trees. For example, http://www.w3.org/TR/query-algebra/ uses "forest" for an ordered collection of trees, and "unordered forest" for an unordered collection of trees. I dislike "hedge" because I don't think it is closely enough related to "tree"; "orchard" is cute, but I think probably too cute.
Murata Makoto wrote: Before TATA, Gecsec and Steinby was the only book on tree automata. They used forests to mean sets of trees. Prof. Wood is very knoledgeable about formal language theory, and he strongly believes forests are sets. I think that the XML Query WG made a mistake. I e-mailed the editors. "Hedge" was coined by Bruno Coucelle. [Cou89] Bruno Courcelle. On recognizable sets and tree automata. In Maurice Nivat and Hassan Ait-Kaci, editors, Resolution of Equations in Algebraic Structures. Academic Press, 1989. As you know, I used to use forests. However, I followed Prof. Wood's advice and switched to hedges.
Here's some stuff about easy datatype assignment for TREX.
Here's a simple and interesting design for parameterized patterns, that didn't make it into TREX or Relax NG.
On balancing the simple specification with the rigorous mathematical model:
James Clark wrote: I also don't think a description in terms of automata is a good for the spec, because it would require that the spec describe how to build an automata. I also want to keep things easy for people who aren't using automata to implement this.
Kohsuke KAWAGUCHI wrote: I think the spec should describe constraints as currently it does, too. But I feel it should be based on a mathematical model, otherwise it will be unnecessarily ad hoc.
Murata Makoto describes a difference between himself and James Clark:
You tend to think that no restrictions are easier to understand for users. On the other hand, I tend to think that tight restrictions are easier to understand for users.
Why is it called Relax NG?
Name change. James - blending of Relax/TREX name hard to do, but we don't want a completely new name [because it would suggest a proliferation of schema languages which we don't want]. James - I think we should go with Relax 2.0 or some adjective [or tag], don't like what "2.0" suggests though. M.-san - I like this. It will be easier to advance JIS/ISO specs. Josh - Relax .NET? K.-san - Relax++? James - Relax NG [Next Generation]?