Friday, January 16, 2009

3 applications of "indexed composition" as a language design principle

Here are links to 3 toy language specs (as webapps) for manipulating and computing with
i have included below a short exegesis of the strategy for generating these languages and some examples from each language.

Building the examples

To build these examples on a *nix box you need at a minimum
  • svn client installed
  • mvn (2.x) installed
  • for GraphL you also need graphviz installed and its dot utility on your execution path

The steps to build any one of these is the same, but for purposes of discussion let's use naturalselection

> svn co http://svn.biosimilarity.com/src/open/naturalselection/trunk naturalselection
...
> cd naturalselection
> mvn compile
> mvn jetty:run

point your browser at localhost:8080 and play around.


A brief explanation of how this stuff works

In essence the idea can be explained in terms of a data structure that supports a some form of composition. The simplest of these (with identity and an associative notion of composition) is a monoid. An instance of such a data type is freely generated from

m,n ::= e | g1 | ... | gN | m*n

and cut to order with the identities, e*m = m = m*e, m1*(m2*m3) = (m1*m2)*m3, together with any identities a specific monoid might require.

This is all well and good, but many times we want to provide a form of indexed composition. We are familiar with this in arithmetic where we want indexed sums or indexed products. These are usually written as expressions of the forms
  • i ∈ S t(i)
  • i ∈ S t(i)
However, it is clear that such expressions can really be expressed as a kind of comprehension
  • {+ t(i) | i <- S +} // we move from ∑ summation to "braced-summation"
  • {* t(i) | i <- S *} // we move from ∏ product to "braced-product"

Again, there is not much to this than a shift in syntax. However, there is a question about the language for
  • generators, e.g. i <- S, and
  • conditions, e.g. p(i), where p is some predicate on i
Here, we can use two monads to generate the language for predicates and patterns (the source and target of substitutions). For this we only need to provide as input a 'notion of collection' and then both the predicate and pattern language fall out.
Suppose our 'notion of collection' is set. Then the logic language for the predicates is given by

k,l ::= true | ~k | k&l | e | g1 | ... | gN | k*l | \forall x in k.l | rec x.k | x

with semantics as discussed, for example, here.

The pattern language is given by

p,q ::= _ | v | e | g1 | ... | gN | p*q

Finally, our indexed composition is given by

{* p | q1 <- m1, ..., qN <- mN, qi1 in k1, ..., qiM in kM *}

Pragmatically, a term of this form of this will evaluate to

(p s1) * (p s2) * ... * (p sN) * ...

where si are substitutions generated from the generators that simultaneously satisfy the constraints. In fact, there is a nice 'evaluation' theorem which says that every term in Q m c evaluates to a (possibly lazy) extensional form. This is essentially a generalization of (the construction underlying) Codd's theorem.

If you recognize that a monad is really a monoid in disguise (or the other way around depending on which side of the mirror you live on), you will see that this is really just Oleg's LogicT monad transformer generalized with a 'notion of collection'.

More specifically, given a data type, like m above, that supports some composition we can generate a new data type Q m c, (where c is the 'notion of collection'). In the case that c is set, then QSet m = Q m set is given by

m,n ::= e | g1 | ... | gN | m*n | {* p | q1 <- m1, ..., qN <- mN, qi1 in k1, ..., qiM in kM *}
k,l ::= true | ~k | k&l | e | g1 | ... | gN | k*l | \forall x in k.l | rec x.k | x
p,q ::= _ | v | e | g1 | ... | gN | p*q


In the code at the links above i'm still writing the evaluators. The one for GraphL is filled in enough that the REPL doesn't just print out the parse tree for the expression but actually generates an svg image using GraphViz magic.

The others are still in progress, so check back at the repo occasionally, as there will be updates to the source. A final comment regarding the generality of this approach. The examples were chosen for a reason.
  • In the case of naturalselection, we have essentially autogenerated a pretty compelling version of SQL.
  • In the case of graphs not only have we given an algebra of graphs with nice properties we have also autogenerated a query language for graphs.
  • In the case of processes you will note that processes have execution semantics. They evolve in time. The logic that is autogenerated allows constraints that govern the execution semantics of processes. The constraint solver is essentially a model-checker. There's an older codebase written in OCaml instead of Scala+Lift where the REPL and model-checker considerably more developed, here. This codebase does not explore the idea of indexed composition, however.

The point of this is that the approach is extremely general and yet effective and realizable with tools that are freely available on the web.

Examples of the toy language naturalselection
  • The empty record and a synonym
Nil
record () { }
  • An extensionally described dataset -- this is the equivalent of dump from an sql table
record (F1,F2,F3){ (1,2.0,"three"), (4,5.0,"six") }
  • An intensionally described dataset -- describing a dataset in terms of a composition of some given datasets R1, R2, R3
record (F1,F2,F3) { (x,y,z) | (x,u,w) <- R1, (v,y,a) <- R2, (t,s,z) <- R3, u == v, a > w }
this returns all the triples (x,y,z) the 1st column of which comes from the first column of R1, the 2nd column of which comes from the second column of R2, and the 3rd column comes from the 3rd column of R3; the dataset only contains those triples for which the 2nd column of R1 equals the 1st column of R2 and the 3rd column of R2 is greater than the 3rd column of R1

  • Another few examples of intensionally described datasets
let {: (a,b,c) ; R1 :} = R in record (F1,F2,F3) { (x,y,z) | (x,u,w) <- R1, (v,y,a) <- R2, (t,s,z) <- R3, u == v, a > w }

let {: (a,b,c) ; R1 :} = R in record (F1,F2,F3) { (x,y,z) | (x,u,w) <- R1, (v,y,a) <- R2, (t,s,z) <- R3, u == v, a > w, (u,v,w) in (~null,~null,~null) }

let {: (a,b,c) ; R1 :} = R in record (F1,F2,F3) { (x,y,z) | (x,u,w) <- R1, (v,y,a) <- R2, (t,s,z) <- R3, u == v, a > w, (u,v,w) in (~null,~null,~0) }

These three illustrate 'destructuring' a given dataset in a let binding. This particular binding picks off the first triple from R and uses the 'rest' as the value R1. These examples extend the previous one by adding structural logical conditions. The middle one requires that we don't pick out tuples with null values for u,v and w. The last one requires that we don't pick out null values for u and v and that w is non-zero.

  • Operations on dataset expressions 'in place'
(2,3,5) :: record (F1,F2,F3) { (x,y,z) | (x,u,w) <- R1, (v,y,a) <- R2, (t,s,z) <- R3, u == v, a > w }

This example 'conses' a triple onto a dataset containing triples
  • This one illustrates the creation of 'infinite' datasets using lazy semantics
let R = (( (2,3,5) :: record (F1,F2,F3){ (x*x,y*y,z*z) | (x,y,z) <- R } )) in R

The first 5 triples of this dataset are
{(2,3,5),(4,9,25),(16,81,625),(256,6561,390625),(65536,43046721,152587890625)}


Examples of the toy language GraphL
  • The discrete graph with 3 vertices
<[[1]]> | <[[2]]> | <[[3]]>
  • The graph with two vertices and an edge between them
(("phred"))(let x = [[1]] in <[[1]]>, let y = [[2]] in <[[2]]>)
  • Two copies of the previous example laid side-by-side
(("phred"))(let x = [[1]] in <[[1]]>, let y = [[2]] in <[[2]]>) | (("jorge"))(let u = [[3]] in <[[3]]>, let v = [[4]] in <[[4]]>)
  • Lest you think these examples verbose, check out the docs (with pictures) in the doc directory of the GraphL project to see examples for where the programmatic language starts to pay off; but please note the documents are slightly out of date with respect to the running application and i haven't taken the time to update them.

Examples of the toy language parallelOgram

  • The stopped process
0
  • A process waiting on input at channel x, only to terminate immediately after
x?(y)0
  • The process that will deliver the code for the stopped process to some lucky process waiting on input at channel x
x!('0)
  • The process that will wait for input on channel x or channel u and then terminate
switch { x?(y)0, u?(v)0 }
  • The process that runs the process that outputs on channel x the code for the stopped process in parallel with a process waiting for input on channel x (and then terminates)
{ x?(y)0, x!('0) }
  • The classic race -- the switched process will wait for input on x or u and is running in parallel with two processes, one that outputs on x and one that outputs on u
{ u!('0), x!('0), switch { x?(y)0, u?(v)0 } }
  • This one is pretty cool -- it's a parallel version of the paradoxical combinator -- essentially it implements (a much too eager form of) Milner's replication, !P
{ x?(y){*y, x!(y)}, x!('{ P, x?(y){*y, x!(y)} }) }
  • Our first example of a comprehension in processes this will generate an (output x on running in parallel with the switch) running in parallel with (output on u running in parallel with the switch)
{ {: P, 'switch { x?(y)0, u?(v)0 } :} | P <- { x!('0), u!('0) } }
  • Another example of a comprehension -- this generates a switch with branch continuations taken from the outputting processes
{ :switch: { x:?(@X)P, u:?(@Y)P } | P <- { x!('0), u!('0) } }

8 comments:

leithaus said...

One point that i didn't make clear is that indexed compositions can also be pushed to store -- they are not just in-memory constructions. This is a marked distinction from standard relational dbs.

Titto Assini said...

Sounds very interesting, but, for mathematically challenged people like me, could you expand a bit on how exactly you derive the logic and pattern language? You say:

"Suppose our 'notion of collection' is set. Then the logic language for the predicates is given by ..."

How exactly do you get that? And what if the collection were not a set? How would that change the derived language?

Thanks for a most interesting post.

leithaus said...

Titto -- thanks! i had so much else to say that it was impossible to recap the results of how to generate the logic (resp pattern) language. i did include a link to a previous discussion http://biosimilarity.blogspot.com/2007/02/algebraic-databases.html

In general this idea has a long history. Abramsky found a version which he dubbed domain theory in logical form. More recently, i realized that the essence of the construction is really determined by a pair of monads.

The first monad, let us call T, is the monad for the term language of the data structure. This is essentially a presentation of the term language as an Eilenberg-Moore algebra of the monad. Or, if you like it's some parametric data type in Haskell (subject to some constraints).

The second monad, let us call it S, is a monad for a notion of collection, like the list monad or the set monad or the sets-of-sequences monad or the tree monad or...

We know -- up front -- that the denotations of logical formulae should be collections of terms. What the construction does is to create a "relationally" complete language. That is, it constructs a logical language that expresses all of the collections you might want to express.

The construction is in essence a distributive law. That is it is a morphism from ST to TS. It shows how to make collections of terms into terms built from collections. The core notion of collection gives the basic logical constructors. Sets, for example, give you boolean connectives (true, negation and conjunction). But, then, the term structure generate new logical connectives. In the case of the monoid example in the post, the monoid identity as well as the monoid generators have counterparts as formulae that denote all terms that are structurally equivalent to them. That is, all terms that can be proven equal using the relations of the monoid presentation. Further the monoid composition also generates a logical connective that denotes a kind of term splitting -- see the link i gave above and in the blog post.

alexis said...

Greg,

What you have described is very interesting, but the presentation makes it difficult to see how it could be applied to an existing DSL.

It would help if you could state what conditions are necessary for the construction, and which are just used in an example. For instance you use monoids to motivate the introduction but then it later becomes clear that the necessary structure is monadic, and monoids are used to build examples.

By the same token, why are the term structures in your examples the right ones? Could the examples be even simpler? It would be useful to see some 'minimal' examples containing only what is needed, and no more. Then the machinery will be more apparent.

This could helpfully be motivated by some introductory anecdotes of the form "normally we say this, but we would like to say this, and build it this way".

Cheers,

alexis

alexis said...

Post scriptum - the link to your algebraic databases post of 2007 is helpful; IMHO it would be worth using your original question from that post as a worked example, since the term algebras are more familiar to the uninitiated. The leap to records and queries from 2007, to the ideas in this post, is all very fine. But it could be spelt out more, and the boolean operators of 2007 are a darn sight easier for the eye/brain to recognise immediately.

leithaus said...

Alexis,

Thanks for your comment. i can't make the examples any simpler than a monoid plus a 'notion of collection'. That's the minimally interesting example. See the first 5 paragraphs of the post after the heading "a brief explanation of how this stuff works".

Monadic structure is not required. However, because monads are monoid objects in the category of endofunctors -- and because all the parametrically polymorphic notions of collection are also presentable as monads -- we recover the construction as a triple of morphisms on monads. One morphism generates the logic language from the term language plus the notion of collection. One morphism generates the pattern language from the term language plus the notion of collection. One morphism generates the new data structure with the indexed compositions from the original with only extensional compositions.

Also, i should clarify, the idea is to generate a DSL from a given term language. If the term language is also a DSL, then the method as described works in a mechanical fashion.

Best wishes,

--greg

leithaus said...

Alexis,

i will agree, though, that the mathematical notation of the algebraic database post is considerably simpler to read and that the simultaneous specification of syntax and semantics is truly gorgeous -- thanks Luis Caires!

One of the several aims the indexed compositions post was to show that this is more than just mathematical curiosity; that we can generate some fairly convincing language in their detail and scope. Another aim was to show that this work is a natural generalization of Oleg's LogicT transformer.

alexis said...

Thanks Greg, that is much clearer now!

I had thought your presentation was in effect: monad * monad = DSL. A mutual acquaintance pointed out that this makes sense since eg an interpreter or a compiler can be a monad.

I see now that your presentation is more general.

alexis