The Holy Trinity

March 27, 2011

The Christian doctrine of trinitarianism states that there is one God that is manifest in three persons, the Father, the Son, and the Holy Spirit, who together form the Holy Trinity.   The doctrine of computational trinitarianism holds that computation manifests itself in three forms: proofs of propositions, programs of a type, and mappings between structures.  These three aspects give rise to three sects of worship: Logic, which gives primacy to proofs and propositions; Languages, which gives primacy to programs and types; Categories, which gives primacy to mappings and structures.  The central dogma of computational trinitarianism holds that Logic, Languages, and Categories are but three manifestations of one divine notion of computation.  There is no preferred route to enlightenment: each aspect provides insights that comprise the experience of computation in our lives.

Computational trinitarianism entails that any concept arising in one aspect should have meaning from the perspective of the other two.  If you arrive at an insight that has importance for logic, languages, and categories, then you may feel sure that you have elucidated an essential concept of computation—you have made an enduring scientific discovery.  Advances in our understanding of computation may arise from insights gained in many ways (any data is useful and relevant), but their essential truth does not depend on their popularity.

Logic tells us what propositions exist (what sorts of thoughts we wish to express) and what constitutes a proof (how we can communicate our thoughts to others).  Languages (in the sense of programming) tells us what types exist (what computational phenomena we wish to express) and what constitutes a program (how we can give rise to that phenomenon).  Categories tell us what structures exist (what mathematical models we have to work with) and what constitutes a mapping between them (how they relate to one another).  In this sense all three have ontological force; they codify what is, not how to describe what is already given to us.  In this sense they are foundational; if we suppose that they are merely descriptive, we would be left with the question of where these previously given concepts arise, leading us back again to foundations.  It is the foundations that I wish to describe here, because I believe it will help to clarify some common misunderstandings about the notions of proposition, type, and structure.  Of particular interest here is that a “type system” is not, under this conception, an arbitrary collection of conditions imposed on a previously given notion of program (whether written with horizontal lines, or not).  It is, rather, a way to say what the programs are in the first place, and what they mean as proofs and as mappings.

Here I will outline the basic correspondences between logic, languages, and categories by examining their structural properties (and, for now, nothing more).

The fundamental notion in logic is that of entailment, written $P_1,\dots,P_n\vdash P$, expressing derivability of $P$ from $P_1,\dots, P_n$.  This means that $P$ is derivable from the rules of logic, given the $P_i$ as axioms.  In contrast to admissibility (which I will not discuss further here) this form of entailment does not express implication!  In particular, an entailment is never vacuously true.  Entailment enjoys at least two crucial structural properties, making it a pre-order:

$\displaystyle{\strut\over{P\vdash P}}$

$\displaystyle{{P\vdash Q\quad Q\vdash R}\over{P\vdash R}}$.

$\displaystyle{{P_1,\dots,P_n\vdash Q}\over{P_1,\dots,P_n,P_{n+1}\vdash Q}}$

$\displaystyle{{P_1,\dots,P_i,P_{i+1},\dots,P_n\vdash Q}\over{P_1,\dots,P_{i+1},P_{i},\dots,P_n\vdash Q}}$

$\displaystyle{{P_1,\dots,P_i,P_i,\dots,P_n\vdash Q}\over{P_1,\dots,P_i,\dots,P_n\vdash Q}}$.

These state that “extra” axioms do not affect deduction; the “order” of axioms does not matter; “duplication” of axioms does not matter.  (These may seem inevitable, but in substructural logics any or all of these may be denied.)

In languages we have the fundamental concept of a typing judgement, written $x_1{:}A_1,\dots,x_n{:} A_n\vdash M{:}A$, stating that $M$ is an expression of type $A$ involving variables $x_i$ of type $A_i$.  A typing judgement must satisfy the following basic structural properties:

$\displaystyle{\strut\over{x:A\vdash x:A}}$

$\displaystyle{{y:B\vdash N:C \quad x:A\vdash M:B}\over{x:A\vdash [M/y]N:C}}$

We may think of the variables as names for “libraries”, in which case the first states that we may use any library we wish, and the second states closure under “linking” (as in the Unix tool ld or its relatives), with $[M/x]N$ being the result of linking $x$ in $N$ to the library $M$.  Typically we expect analogues of the “extra”, “reordering”, and “duplication” axioms to hold as well, though this ain’t necessarily so.  I will leave their formulation as an exercise for the reader.

In categories we have the fundamental concept of a mapping $f:X\longrightarrow Y$ between structures $X$ and $Y$.  The most elementary structures, perhaps, are sets, and mappings are functions, but it is more common to consider, say, that $X$ and $Y$ are topological spaces, and $f$ is a continuous function between them.  Mappings satisfy analogous structural properties:

$\displaystyle{\strut\over{\textit{id}_X : X \longrightarrow X}}$

$\displaystyle{{f:X\longrightarrow Y \quad g : Y\longrightarrow Z}\over{g\circ f:X\longrightarrow Z}}$

These express, respectively, the existence of the identity map, and the closure of maps under composition.  They correspond to reflexivity and transitivity of entailment, and to the library and linking rule of languages.  As with types, one may expect additional closure conditions corresponding to the “extra”, “reordering”, and “duplication” axioms by giving suitable meaning to multiple assumptions.  I will not go into this here, but numerous standard sources treat these conditions in detail.

What I find captivating about computational trinitarianism is that it is beautiful!  Imagine a world in which logic, programming, and mathematics are unified, in which every proof corresponds to a program, every program to a mapping, every mapping to a proof!  Imagine a world in which the code is the math, in which there is no separation between the reasoning and the execution, no difference between the language of mathematics and the language of computing.  Trinitarianism is the central organizing principle of a theory of computation that integrates, unifies, and enriches the language of logic, programming, and mathematics.  It provides a framework for discovery, as well as analysis, of computational phenomena.  An innovation in one aspect must have implications for the other; a good idea is a good idea, in whatever form it may arise.  If an idea does not make good sense logically, categorially, and typically (sorry for the neologism), then it cannot be a manifestation of the divine.

The Dog That Didn’t Bark

March 21, 2011

The crucial clue that allowed Holmes to solve the Silver Blaze mystery was the dog that didn’t bark: the apparent murderer and horse thief was not a stranger, but the trainer himself, the only one who could have gotten into the stables unnoticed.  As I’ve mentioned, this semester Dan Licata and I are co-teaching a new course in functional programming for first-year students at Carnegie Mellon.  Apart from parallelism, wbich I’ve already discussed, we are placing a strong emphasis on verification and proof as tools for the practicing programmer, chief among these being the infamous hellhounds of Computer Science, induction and recursion.  And I am happy to say that, as in the Holmes story, these are the dogs that didn’t bark.  The results from homework and an in-class exam make clear that the vast majority of the class of about 80 students have mastered these crucial techniques with a minimum of fuss and bother.  In this post I’d like to explain why I think we’ve succeeded where, apparently, so many others have failed.

One of the pleasures of teaching at Carnegie Mellon is that we the students are bright (among the brightest in the country according to the usual metrics), open-minded (they’re game for whatever I wish to throw at them), and hard-working (Carnegie’s motto is “My heart is in the work.”).  I am sure that this is a big part of the story.  Most probably our students could master almost anything, no matter how poorly we explain it!  But it can’t be just that either.  My colleagues at other top schools, both research universities and teaching colleges, public and private, American and international, have often spoken of their frustration and despair at teaching programming using recursion, let alone proof by induction, to beginning (or even advanced) students.  Some even go so far as to say that it can’t be done, that while the brightest students “get it”, the less bright students never will, so we should just give up even trying.  I’m not convinced.  I can’t prove that I’m right, but I do have quite a few years experience at pulling this off, so it is at least conceivable that the problem lies not with the students, but with the teaching methods.  So let me offer mine as an example, and perhaps others will be able to use them as well.

Let’s start with the punch line, just to pique your interest (or ire): the answer is type theory.  The great idea of constructive type theory is that there is no distinction between programs and proofs: the code embodies the reasoning the justifies it, and the reasoning is just a form of code.  This is the propositions as types principle (sometimes grandly called the “Curry-Howard isomorphism”, the small problem being that neither Curry nor Howard invented it, nor is it an isomorphism).  Theorems are types (specifications), and proofs are programs.  Programs/proofs are executed by evaluating them by applying simplification rules to reduce them to canonical form.  It’s just generalized high school algebra, in which you plug in values for variables and simplify using equational reasoning, but scaled up to a rich collection of types and programs acting on them.  In other words type theory is a language-based model of computation, which supports reasoning directly about the code you write, not its translation or interpretation in some other terms.

The significance of a language-based model for teaching cannot be overestimated.  One of the classic pitfalls of teaching recursion is to try to explain it by referring to some mysterious entity called a “stack” that is absolutely invisible in the program you write.  There’s a long song and dance about “frames” and “procedure calls” and “pushing” and “popping” things on this “stack” of which the professor speaks, and no one understands a damn thing.  It all sounds very complicated, and impressive in a way, but if there’s one thing people learn, it’s to stay the hell away from it!  And in doing so, students are robbed of one of the most powerful and useful programming tools available to them, the key to writing clean, reliable code (and, as I’ve previously argued, for parallelism as well).  The alternative is to use Plotkin’s structural operational semantics to define the behavior of all language features, including recursion.  To pull it off, you have to have a semantics for the language you’re using (something that’s in short supply for commercial languages), and you have to be facile in using it yourself.

What makes structural operational semantics so useful is that the execution model is directly defined by an inductive definition of a transition relation, $M\mapsto M'$, on programs.  (I don’t have the space to explain it fully here, but you may consult Practical Foundations for Programming Languages for a full account.)  The crucial idea is to use inference rules to give a structural definition of evaluation for programs themselves, not a translation into some underlying abstract machine code.  For example, one rule for evaluation of addition expressions is

$\displaystyle{M\mapsto M'\over M+N\mapsto M'+N}$,

which states that one step of evaluation of the sum $M+N$ consists of a step of evaluation of $M$ to obtain $M'$, with the result being $M'+N$.  Another rule states that

$\displaystyle{N\mapsto N'\over m+N\mapsto m+N'}$,

where $m$ is a number, and a third states that $\displaystyle{m+n\mapsto p}$, where $p$ is the sum of $m$ and $n$.  With these (and similar) rules in place, one may give an execution trace in the form

$M_1\mapsto M_2 \mapsto \cdots \mapsto M_k$,

which makes explicit the steps of evaluation.  Each step is justified by a “proof tree” consisting of a composition of rules.  Importantly, the size of the expressions $M_i$ correlates directly with the space required by an implementation, allowing for a clean and clear explanation of space issues such as tail recursion.

This is a good start for explaining the semantics of a recursive function, but the real beauty of type theory emerges when teaching how to reason about a recursively defined function.  The lesson of type theory is that induction and recursion are the same thing: an inductive proof is a recursive function, and a recursive function is an inductive proof.  This means that there is really only one concept here, and not two, and that the two go hand-in-hand.  It’s not a matter of “eat your spinach, it’s good for you”, but rather of a conceptual unification of proving and programming.  For our students thinking inductively and programming recursively go together naturally, the dynamic duo of functional programming.

The starting point is, once again, an inductive definition.  For specificity and ease of exposition, let me give an inductive definition of the natural numbers:

$\displaystyle{\strut\over\texttt{Zero}\,\textsf{nat}}$

$\displaystyle{m\,\textsf{nat}\over \texttt{Succ}(m)\,\textsf{nat}}$

The definition uses the exact same rule format that we used to define the semantics.  Inductive definitions given by a set of rules become second nature.  Moreover, they transcribe almost literally into ML:

datatype nat = Zero | Succ of nat

So here we have the primordial example of an inductively defined type.  Many data structures, such as various forms of trees, fit naturally into the same format, a fact that we use repeatedly throughout the semester.

Defining a function on an inductive type consists of giving a series of clauses, one for each rule defining the elements of that type:

fun plus x y = case x of  Zero => y | Succ x’ => Succ (plus x’ y)

Whenever we have an inductive type, we use the same format each time for functions defined over it.  Crucially, we allow ourselves “recursive calls” corresponding exactly to the premises of the rule governing a particular clause.  There is no need for any fuss about “pushing” things, nor any discussion of “sizes” of things, just a pattern of programming that is guided directly by the definition of the type on which we are working.

And now comes the best part.  If we want to prove something about a recursively defined function, we write a proof that has exactly the same form, one case for each rule in the inductive definition.  And, just as we get to make a “recursive call” for each premise of each rule, so we also get to make a “recursive call” to the proof for each premise, that is, an appeal to the inductive hypothesis.  The proof is, of course, a program that constructs a proof for every element of an inductive type by composing the inductive steps around a base case.

Importantly, the students “get it”.  On the midterm we ask them to prove (in an in-class setting) a small theorem about a piece of code, or to write a piece of code and prove a small theorem about it, and they are able to do this quite reliably.  In other words, it works!

Perhaps the most important thing, though, is to convey the sense of inevitability and naturality about induction and recursion.  Don’t talk about proof as a separate topic!  Don’t operationalize recursion beyond the bounds of the language model.  Simply do the natural thing, and it will all seem natural and obvious … and fun!

Moreover, it scales.  Our high-water mark prior to the midterm is to develop and debug a continuation-based regular expression matcher that has a subtle bug in it.  We state carefully the specification of the matcher, and use an inductive proof attempt to reveal the error.  The proof does not go through!  And the failure makes clear that there is a bug that can be fixed in a number of ways, repairing the proof along with the code.  I will detail this another time, but the impatient reader may wish to consult my JFP paper entitled “Proof-directed debugging” for an earlier version of what we taught this semester.

Let me close with two remarks.

Matthias Felleisen at Northeastern has been advocating essentially this methodology for decades; his work has strongly influenced mine.  His results are nothing short of spectacular, and with a much broader range of students than I ordinarily face here at Carnegie Mellon.  So, no excuses: read his How to Design Programs, and follow it!

Finally, this article explains why Haskell is not suitable for my purposes: the principle of induction is never valid for a Haskell program!  The problem is that Haskell is a lazy language, rather than a language with laziness.  It forces on you the existence of “undefined” values of every type, invalidating the bedrock principle of proof by induction.  Put in other terms suggested to me by John Launchbury, Haskell has a paucity of types, lacking even the bog standard type of natural numbers.  Imagine!

(And please note the analogy with dynamic languages, as opposed to languages with dynamic types. It’s the same point, albeit in another guise, namely the paucity of types.)

Dynamic Languages are Static Languages

March 19, 2011

While reviewing some of the comments on my post about parallelism and concurrency, I noticed that the great fallacy about dynamic and static languages continues to hold people in its thrall.  So, in the same “everything you know is wrong” spirit, let me try to set this straight: a dynamic language is a straightjacketed static language that affords less rather than more expressiveness.   If you’re one of the lucky ones who already understands this, congratulations, you probably went to Carnegie Mellon!  For those who don’t, or think that I’m wrong, well let’s have at it.  I’m not going to very technical in this post; the full technical details are available in my forthcoming book, Practical Foundations for Programming Languages, which is available in draft form on the web.

So-called dynamic languages (“so-called” because I’m going to argue that they don’t exist as a separate class of languages) are perennially popular.  From what I can tell, it’s largely a matter of marketing.  Hey, we all want to be dynamic, don’t we?  A dynamic personality is one that makes things happen, takes risks, has fun, breaks the rules!  And who wants to be static?  It’s synonymous with boring, after all, and no one wants to be a bore (even if they are, or especially if they are).  So, hey, dynamic languages are cool, right?  Or, at any rate, cool people like dynamic languages.  Static languages are for drips and losers.

That’s one argument, and, to be honest, it’s quite a successful bit of marketing.  Like most marketing tactics, it’s designed to confuse, rather than enlighten, and to create an impression, rather than inform.

Another part of the appeal of dynamic languages appears to be that they have acquired an aura of subversion.  Dynamic languages fight against the tyranny of static languages, hooray for us!  We’re the heroic blackguards defending freedom from the tyranny of typing!  We’re real programmers, we don’t need no stinking type system!

Now I’m as susceptible to the seductions of subversion as the next guy (witness this very post), so I can appreciate wanting to strike one for the little guy, overturning the evil establishment.  But I’ve got news for you: when it comes to “dynamic typing”, it’s a futile enterprise.  Not because the establishment is so powerful.  Not because evil always wins.  But because the distinction makes no sense.  Dynamic typing is but a special case of static typing, one that limits, rather than liberates, one that shuts down opportunities, rather than opening up new vistas.  Need I say it?  Something can hardly be opposed to that of which it is but a trivial special case.  So, give it up, and get with the program!  There are ill-defined languages, and there are well-defined languages.  Well-defined languages are statically typed, and languages with rich static type systems subsume dynamic languages as a corner case of narrow, but significant, interest.

Wittgenstein said that all philosophical problems are problems of grammar.  And so it is here as well.  The root of the problem lies in the confusion between a type and a class.  We all recognize that it is often very useful to have multiple classes of values of the same type.  The prototypical example is provided by the complex numbers.  There is one type of complex numbers that represent points in two-dimensional space.  In school we encounter two classes of complex numbers, the rectangular and the polar.  That is, there are two ways to present a complex number using two different systems of coordinates.  They are, of course, interchangeable, because they represent values of the same type.  But the form of presentation differs, and some forms are more convenient than others (rectangular for addition, polar for multiplication, for example).  We could, of course, choose a “coin of the realm”, and represent all complex numbers in one way, and consider coordinates as just a matter of how a number is given.  But it can also be convenient to consider that the type of complex numbers consists of two classes, each of which consists of two real numbers, but interpreted differently according to the coordinate system we are using.

Crucially, the distinction between the two classes of complex number is dynamic in that a given computation may result in a number of either class, according to convenience or convention.  A program may test whether a complex number is in polar or rectangular form, and we can form data structures such as sets of complex numbers in which individual elements can be of either form.  But this does not conflict with the basic fact that there is but one static type of complex numbers!  In type theoretic terms what I am saying is that the type complex is defined to be the sum of two copies of the product of the reals with itself.  One copy represents the class of rectangular representations, the other represents the class of polar representations.  Being a sum type, we can “dispatch” (that is, case analyze) on the class of the value of the type complex, and decide what to do at run-time.  This is no big deal.  In fact, it’s quite simple and natural in languages such as ML or Haskell that support sum types.  (Languages that don’t are hobbled, and this is part of the confusion.)

What does this have to do with the concept of a dynamic language?  The characteristics of a dynamic language are the same, values are classified by into a variety of forms that can be distinguished at run-time.  A collection of values can be of a variety of classes, and we can sort out at run-time which is which and how to handle each form of value.  Since every value in a dynamic language is classified in this manner, what we are doing is agglomerating all of the values of the language into a single, gigantic (perhaps even extensible) type.  To borrow an apt description from Dana Scott, the so-called untyped (that is “dynamically typed”) languages are, in fact, unityped.  Rather than have a variety of types from which to choose, there is but one!

And this is precisely what is wrong with dynamically typed languages: rather than affording the freedom to ignore types, they instead impose the bondage of restricting attention to a single type!  Every single value has to be a value of that type, you have no choice!  Even if in a particular situation we are absolutely certain that a particular value is, say, an integer, we have no choice but to regard it as a value of the “one true type” that is classified, not typed, as an integer.  Conceptually, this is just rubbish, but it has serious, tangible penalties.  For one, you are depriving yourself of the ability to state and enforce the invariant that the value at a particular program point must be an integer.  For another, you are imposing a serious bit of run-time overhead to represent the class itself (a tag of some sort) and to check and remove and apply the class tag on the value each time it is used.  (See PFPL for full details of what is involved.)

Now I am fully aware that “the compiler can optimize this away”, at least in some cases, but to achieve this requires one of two things (apart from unreachable levels of ingenuity that can easily, and more profitably, be expressed by the programmer in the first place).  Either you give up on modular development, and rely on whole program analysis (including all libraries, shared code, the works), or you introduce a static type system precisely for the purpose of recording inter-modular dependencies.  The whole-program approach does not scale, and flies in the face of the very advantage that dynamic languages supposedly have, namely dynamic evolution and development of components.  The static type system approach works beautifully, and makes my point nicely.

I am also fully aware that many statically typed languages, particularly the ones widely in commercial use, do not have a sufficiently expressive type system to make it feasible to work dynamically (that is, with multiple classes of values) within a statically typed framework.  This is an argument against bad languages, not an argument against type systems.  The goal of (static, there is no other kind) type systems is precisely to provide the expressive power to capture all computational phenonema, including dynamic dispatch and dynamically extensible classification, within the rich framework of a static type discipline.  More “researchy” languages such as ML or Haskell, have no trouble handling dynamic modes of programming.

Why should you care?  Because, I argue, that a fully expressive language is one that supports the delicate interplay between static and dynamic techniques.  Languages that force only one perspective on you, namely the dynamic languages, hobble you; languages that admit both modes of reasoning enable you and liberate you from the tyranny of a single type.  Let a thousand flowers bloom!

Parallelism Is Not Concurrency

March 17, 2011

In an earlier post I mentioned that one goal of the new introductory curriculum at Carnegie Mellon is to teach parallelism as the general case of computing, rather than an esoteric, specialized subject for advanced students.  Many people are incredulous when I tell them this, because it immediately conjures in their mind the myriad complexities of concurrency: locks, monitors, deadlock, dining philosophers, ….  And if you look at the recent research literature, you will see countless papers of the form (a) parallelism is important, so (b) let’s talk about concurrency.  The situation has gotten so bad that I think that most computer scientists cannot distinguish the two concepts.  (Or maybe they willfully confuse the two so as to justify funding for work on concurrency by appealing to the importance of parallelism.)

Given this state of affairs, I cannot explain what we are doing at Carnegie Mellon (to anyone who doesn’t already know) without first rebooting you.  Trust me, I’m just going to hit Ctrl-Alt-Delete, and then click Restart.  OK, we’re back.

The first thing to understand is parallelism has nothing to do with concurrency.  Concurrency is concerned with nondeterministic composition of programs (or their components).  Parallelism is concerned with asymptotic efficiency of programs with deterministic behavior.  Concurrency is all about managing the unmanageable: events arrive for reasons beyond our control, and we must respond to them.  A user clicks a mouse, the window manager must respond, even though the display is demanding attention.  Such situations are inherently nondeterministic, but we also employ pro forma nondeterminism in a deterministic setting by pretending that components signal events in an arbitrary order, and that we must respond to them as they arise.  Nondeterministic composition is a powerful program structuring idea.  Parallelism, on the other hand, is all about dependencies among the subcomputations of a deterministic computation.  The result is not in doubt, but there are many means of achieving it, some more efficient than others.  We wish to exploit those opportunities to our advantage.

Now I can hear you object, but isn’t concurrency required to implement parallelism?  Well, yes, it is, but concurrency is also required to implement sequentiality too!  The timing signal on your processor chip is essentially a synchronization mechanism with which to coordinate the otherwise independent activity of the components of the processor.  (Yes, I know that there are asynchronous processors, but this only makes my point stronger, you still need to control the concurrency.)  The point is that concurrency is not relevant to parallelism, even if the engineers who build our parallel computing platforms must deal with concurrency.  Another way to say the same thing is that parallelism is a useful abstraction, and abstractions should never be confused with their implementations.  (Numbers are not certain sets, but they can be implemented in terms of sets.  And so on.)

All well and good, parallelism is not concurrency, but what is it?  What is an appropriate model for parallel computation?  As in the sequential case, there are two answers, a machine-based approach and a language-based approach.  I am going to criticize the machine-based formulation, and argue for the language-based formulation on two grounds: (a) it is inherently more elegant and natural, and (b) it can be taught directly to students without any fuss or bother.  Much of what I have to say is derived from the seminal and beautiful work of Guy Blelloch, which I have absorbed over years of collaboration.  To the extent I’ve gotten it right, give Guy the credit; to the extent that I’ve gotten it wrong, assign me the blame.

The canonical machine-based model is the PRAM, or parallel RAM, which generalizes the classical RAM model to allow p>0 processors that share a common memory.  There are many variations on the PRAM, according to what sorts of interactions are allowed and how conflicts are resolved among the processors.  This is not important for present purposes, so I will not attempt to categorize the variations here.  The key idea is that the number, p, of processors is fixed in advance—results are stated in terms of how efficiently an algorithm executes on a p-processor PRAM as a function of the size, n, of the input.  This is well and good as a cost model for parallel execution, but it totally sucks as a programming model!  To write code for a PRAM, you fix the number of processors, and write the code for that number of processors (often the various cases are “analogous”, but in reality you must fix p and write code for that fixed p).  Moreover, you must (officially, anyway) write low-level imperative RAM code, which introduces spurious dependencies among sub-computations by explicitly scheduling the steps of the computation.  This is horrendous!  Of course typically one writes pidgin C, and sketches how it is compiled, but the fact is you’re still writing sequential, imperative code to implement a parallel algorithm.

As I’ve argued previously, what is needed is a language-based model of computation in which we assign costs to the steps of the program we actually write, not the one it (allegedly) compiles into.  Moreover, in the parallel setting we wish to think in terms of dependencies among computations, rather than the exact order in which they are to be executed.  This allows us to factor out the properties of the target platform, such as the number, p, of processing units available, and instead write the program in an intrinsically parallel manner, and let the compiler and run-time system (that is, the semantics of the language) sort out how to schedule it onto a parallel fabric.  Here’s a useful analogy.  Just as abstract languages allow us to think in terms of data structures such as trees or lists as forms of value and not bother about how to “schedule” the data structure into a sequence of words in memory, so a parallel language should allow us to think in terms of the dependencies among the phases of a large computation and not bother about how to schedule the workload onto processors.  Storage management is to abstract values as scheduling is to deterministic parallelism.

The paradigmatic example is provided by Quicksort, a naturally parallel algorithm.  Many people think that the “big idea” of Quicksort is the in-place swapping technique that Hoare invented in the original formulation.  I argue that this is a detail; the “big idea” is that it is parallelizable.  To see this, we must work at a reasonable level of abstraction: a sort function permutes a non-empty sequence of (comparable) elements to obtain another sequence in sorted order.

fun qs xs = if |xs|=1 then xs else let val (xsl, xsg) = split xs in concat (qs xsl, qs xsg)

That is, we split the sequence into two parts (by some means) such that everything in the left part is less than everything in the right part, then sort the parts independently (in parallel!), and merge the results together.  The exact complexity of Quicksort depends on the representation of sequences, and the complexity of the operations on them.  For now, let us just gloss over this to focus on the overall structure of the computation.  If we can ensure that the sequence is split roughly in half (by any of a number of means, including randomization), then the recursion is of logarithmic depth.  At each stage we do linear work, but the recursive calls are independent of one another.  The algorithm has a natural fork-join structure in which we (a) split the sequence into two parts, (b) in parallel sort the two parts, and (c) join the results together.  (The splitting itself can be done in parallel as well, but we’ll not bother about that here.)

Analysis of Quicksort splits into two parts, the work and the depth (or span).  The work is the overall number of steps required to perform the algorithm.  On a sequence of length n, we perform O(n log n) steps to sort the sequence, as might have been expected.  This is the sequential complexity, the amount of time it would take to run on a single processor.  To account for parallelism, there is a second measure, called the depth, that measures the critical path length of the computation, the length of the longest chain of dependencies among subcomputation.  No matter how much parallelism we may have available, we can never run faster than the depth; it represents the idealized parallel complexity of the algorithm, the amount of time it would take given any number of processors.  Depending on the choice of data structures, the depth of Quicksort is O(log^2 n), or thereabouts, which makes it quite parallelizable.  (Note that, as in our earlier discussion, a step of computation is a transition step in the operational semantics of the language we are using to express the algorithm.)

This is very nice and elegant, but now comes the really beautiful part, the concept of a provable implementation of the language.  (The word “provable” is only needed here because of the shabby state of the field; having a proof that your implementation works, and how well it works, remains an oddity to be emphasized, rather than an expectation to be fulfilled by any self-respecting computer scientist.)  The idea is that the language-based model can be implemented on a PRAM (with stated interconnect properties that I will suppress here) with a provable bound on the cost that takes account of the overhead of scheduling onto a p-processor fabric.  Thus, p need not be specified in advance, but is rather determined by the run-time system and compiler, which are provided by the proof of the theorem.

Theorem A computation with work w and depth d can be implemented in a p-processor PRAM in time O(max(w/p, d)).

(Sometimes the second factor is d log p, depending on the assumptions we are making about the target PRAM.)

This is called Brent’s Principle, and it states a fairly obvious fact.  First, we can never execute a depth d program in fewer than d steps.  The depth is the critical path length; this is the amount of time we must wait to execute the algorithm, regardless of any available parallelism.  Second, to the extent that parallelism arises from the dependency (rather, indepedency) structure of the computation, we can at best do p units of work at time, keeping all processors warm, until we are done.  And that’s it!

The effectiveness of the language-based model of parallelism lies entirely in its ability to expose the dependency structure of the computation by not introducing any dependencies that are not forced on us by the nature of the computation itself.  And the key to this is functional programming, which manifests itself here in the transformational approach to computation: sorting is conceived of as a mathematical function that transforms a given sequence into another sequence.  It does not destroy the given sequence any more than adding two numbers destroys those numbers!  Since Quicksort is a mathematical function, we need not worry that execution of qs xsl interferes with (depends on) qs xsg; we can readily run them in parallel without fear of untoward consequences.  The payoff is that there are many fewer dependencies among the subcomputations, and hence many more opportunities for parallelism that can be exploited, in accord with Brent’s Principle, when scheduling the work onto a parallel fabric.

The upshot of all this is that functional programming is of paramount importance for parallelism.  That’s (one reason) why we are teaching functional programming to freshmen at Carnegie Mellon.  And it’s the main reason why we are able to.  There’s nothing to it, just program like you do math, and the parallelism will take care of itself!

Update: Simon Marlowe made similar points to mine some time ago.

Languages and Machines

March 16, 2011

Models of computation may be divided into two categories, the λ-calculus, and all the others.  Well, maybe Post’s production systems and Herbrand-Goedel equations belong in the former category, but certainly all the well-known, and widely accepted, models such as Turing machines or RAM’s, sit quite apart from the λ-calculus.  Guy Blelloch has characterized the division as between languages and machines.  The machine-based models are all based on the idea of a finite control augmented by unbounded memory.  It’s up to the programmer to manage the memory to implement anything more structured than a bit, and there is no notion, other than via an interpreter, for changing the program during execution; the machine models are all decidedly non-von Neumann because they lack the idea of a stored program.  The sole language-based model, by contrast, does not separate program from data, and offers a foundation on which realistic languages may be built, directly and without encoding.  This would seem like a decisive advantage for the λ-calculus: whereas the machine models live in the early chapters of our textbooks, they are of essentially no practical importance, the λ-calculus is directly useful for programming and for mechanization of logical systems.  And yet it is the machine models that dominate, especially in the world of algorithms and complexity, and the λ-calculus remains an esoteric oddity studied only by a relative minority of Computer Scientists.  How can that be?

When confronted with this question, my typical colleague (at least on the theory side) dismisses the question as totally uninteresting because “all models of computation are polynomial-time equivalent”, so who cares?  Well, if your work is all done modulo a polynomial factor in complexity, and you only care about computations over natural numbers (or other finitary objects), then I guess it doesn’t matter.  You tell some story once and for all of how the objects of interest are represented on some model, and leave it at that; the details just don’t matter.

But obviously the models do matter, as evidenced by the inescapable fact that the very same colleague would never under any circumstance consider anything other than a classical machine model as the underpinning for his work!  Surely, if they are all equivalent, then it can’t matter, so we can all switch to the λ-calculus, right?  Well, true though that may be, I can assure you that your argument is going nowhere.  Apparently some models are more equivalent than others after all!  Why would that be?

One reason is that a lot of work is not up to a polynomial factor (one might even suspect, most of it isn’t.)  So for those purposes the model may matter.  On the other hand, no one actually uses the “official” models in their work.  No one writes Turing machine code to describe an algorithm, nor even RAM code (Knuth notwithstanding).  Rather, they write what is charmingly called pidgin Algol, or pidgin C, or similar imperative programming notation.  That is, they use a programming language, not a machine!  As well they should.  But then why the emphasis on machine models?  And what does that pidgin they’re writing mean?

The answer is that the language is defined by a translation (that is, a compiler) that specifies that such-and-such pidgin Algol stands for such-and-such RAM code or Turing machine code.  The description is informal, to be sure, but precise enough that you get the idea.  Crucially, you reason not about the code you wrote, but the code the compiler writes, because officially the target code is the “real” code, the pidgin being a convenience.  For this to work, you have to hold the compiler in your head: it must be clear what is meant by everything you actually write by imagining its translation onto, say, a RAM.  This is not too bad, provided that the pidgin is simple, so simple that you can compile it in your head.  This works ok, but is increasingly problematic, because it limits the range of algorithms we can conveniently express and analyze by restricting the language in which we express them.

So why the insistence on such an awkward story?  One reason is, evidently, tradition.  If it was good enough for my advisor, it’s good enough for me.  Or, if you want to get published, you’d better write in a style that is familiar to the people making the decisions.

A more substantial reason, though, is that machine models are the basis for establishing the cost of a computation.  A “step” of computation is defined to be a machine step, and we compare algorithms by estimating the number of steps that they take to solve the same problem (up to a multiplicative factor, in most cases).  As long as we can “hand compile” from pidgin Algol into machine code, we can assign a cost to programs written in a higher-than-assembly level language, and we can make rational comparisons between algorithms.  This methodology has served us well, but it is beginning to break down (parallelism is one reason, but there are many others; I’ll leave discussion of this for another occasion).

There is an alternative, which is to provide a cost semantics for the language we write, and conduct our analyses on this basis directly, without appeal to a compiler or reference to an underlying machine.  In short we adopt a linguistic model of computation, rather than a machine model, and life gets better!  There is a wider range of options for expressing algorithms, and we simplify the story of how algorithms are to be analyzed.

To do this requires that we give a cost semantics for the language we’re actually using.  And this is where the λ-calculus comes into play, for it is the paradigmatic example of how to specify a language-based model of computation.  Briefly, the crucial notion in the λ-calculus is the concept of a reduction, written M→M’, expressing one step of computation of program M resulting in program M’.  Execution is defined directly on the programs we write, and the model provides a well-defined notion of computation step that we can count to obtain the time complexity of the program.  Decades of experience shows that this approach scales to realistic languages.

What’s not to like?  Personally, I have no idea.  I have been mystified by the suspicion with which the λ-calculus seems to be regarded in algorithmic circles.  Just about any machine model is regarded as a perfectly good foundation, yet the behavior of the field itself shows that machine models are of limited utility for the work that is actually done!

The only reason I can think of, apart from powerful social effects, is that there is lingering, yet totally unfounded, doubt about whether the concept of cost that arises from the λ-calculus is suitable for algorithm analysis.  (One researcher recently told me “you can hide an elephant inside a β-reduction!”, yet he could not explain to me why I’m wrong.)  One way to justify this claim is to define, once and for all, a compiler from λ-calculus to RAM code that makes clear that the abstract steps of the λ-calculus can be implemented in constant time on a RAM (in the sense of not varying with the size of the input, only in the size of the static program).  Blelloch and Greiner have carried out exactly this analysis in their seminal work in the early 90’s; see Guy’s web page for details.

In a later post I will take these ideas a bit further, and show (by explaining Blelloch’s work) that the λ-calculus serves not only as a good model for sequential algorithms, but also for parallel algorithms!  And that we can readily extend the model to languages with important abstractions such as higher-order functions or exceptions without compromising their utility for expressing and analyzing algorithms.

What is a Functional Language?

March 16, 2011

I mentioned in a previous post that I am developing a new course on functional programming for first-year students.  The immediate question, which none of you asked, is “what do you mean by functional programming?”.  Good question!  Now, if only I had a crisp answer, I’d be in good shape.

Socially speaking, the first problem I run into is a fundamental misunderstanding about the term “functional programming”.  Most people think that it stands for an ideological position that is defined in opposition to what everyone else thinks and does.  So in the mind of a typical colleague, when I speak of functional programming, she hears “deviant religious doctrine” and “knows” instinctively to disregard it.  Yet what I mean by functional programming subsumes what everyone else is doing as a narrow special case.  For as readers of this blog surely know, there is no better imperative language than a functional language!  In fact I often say that Haskell is the world’s best imperative programming language; I’m only half (if that much) joking.

So if functional programming subsumes imperative programming, then what the hell am I talking about when I advocate for functional programming at the introductory level?  And why, then, do we also have an imperative programming course?  Very good questions.  I’m not sure I have a definitive answer, but this being a blog, I can air my opinions.

Were it up to me, we would have one introductory course, probably two semesters long, on programming, which would, of course, emphasize functional as well as imperative techniques, all in one language.  Personally, I think this would be the right way to go, and would prepare the students for all future courses in our curriculum, and for work out there in the “real world.”  Not everyone believes me (and sometimes I am not sure I believe myself), so we must compromise.  While it surely sounds all nice and ecumenical to teach both imperative and functional programming, the reality is that we’re teaching old and new programming methods.  It’s not so much the imperative part that counts, but their obsolescence.  It is deemed important (and, to be honest, I can see the merit in the argument) that students learn the “old ways” because that’s how most of the world works.  So we have to use a language that is ill-defined (admits programs that cannot be given any meaning in terms of the language itself), that forces sequential, rather than parallel thinking, that demands manual allocation of memory, and that introduces all sorts of complications, such as null pointers, that need not exist at all.  And then we have to teach about things called “tools” that help manage the mess we’ve created—tools that do things like recover from Boolean blindness or check for null pointer errors.  To be sure, these tools are totally amazing, but then so are hot metal typesetting machines and steam locomotives.

So what, then, is a functional language?  I can do no better than list some criteria that are important for what I do; the languages that satisfy these criteria are commonly called functional languages, so that’s what I’ll call them too.

1. It should have a well-defined semantics expressed in terms of the programs you write, not in terms of how it is “compiled” onto some hypothetical hardware and software platform.  I want to be able to reason about the code I’m writing, not about the code the compiler writes.  And I want to define the meaning of my program in terms of the constructs I’m using, not in terms of some supposed underlying mechanism that implements it.  This is the essence of abstraction.
2. It should support both computation by evaluation and computation by execution.  Evaluation is a smooth generalization of high school- and university-level mathematics, and is defined in terms of standard mathematical concepts such as tuples, functions, numbers, and so forth.  Variables are given meaning by substitution, and evaluation consists of simplifying expressions.  Execution is the action of a program on another agent or data structure conceived of as existing independently of the program itself.  The data lives “over there” and the program “over here” diddles that data.  Assignables (my word for what in imperative languges are wrongly called variables) are given meaning by get and put operations (fetch and store), not by substitution.  Execution consists of performing these operations.
3. It should be natural to consider both persistent and ephemeral data structures.  This is essentially a restatement of (2).  Persistent data structures are values that can be manipulated like any other value, regardless of their apparent “size” or “complexity”.  We can as easily pass functions as arguments and return them as results as we can do the same with integers.  We can think of trees or streams as values in the same manner.  Ephemeral data structures are what you learn about in textbooks.  They are external to the program, and manipulated destructively by execution.  These are useful, but not nearly as central as the current texts would have you believe (for lack of consideration of any other kind).
4. It should have a natural parallel cost model based on the  dependencies among computations so that one may talk about parallel algorithms and their complexity as naturally as one currently discusses sequential algorithms.  Imperative-only programming languages fail this criterion miserably, because there are too many implicit dependencies among the components of a computation, with the result that sequentiality is forced on you by the artificial constraints of imperative thinking.
5. It should have a rich type structure that permits introduction of new abstract types and which supports modular program development.  By giving short shrift to expressions and evaluation, imperative language have an impoverished notion of type—and not support for modularity.  Worse, object-oriented programming, a species of imperative programming, is fundamentally antimodular because of the absurd emphasis on inheritance and the reliance on class-based organizations in which functionality metastasizes throughout a program.

What languages satisfy these criteria best?  Principally ML (both Standard ML and Objective Caml, the “objective” part notwithstanding), and Haskell. (Unfortunately, Haskell loses on the cost model, about which more in another post.)

This is why we teach ML.

Boolean Blindness

March 15, 2011

I hate Booleans!  But isn’t everything “just bits”?  Aren’t computers built from gates?  And aren’t gates just the logical connectives?  How can a Computer Scientist possibly hate Booleans?

Fritz Henglein recently pointed out to me that the world of theoretical Computer Science is divided into two camps, which I’ll call the logicians (who are concerned with languages and semantics) and the combinatorialists (who are concerned with algorithms and complexity).  The logicians love to prove that programs work properly, the combinatorialists love to count how many steps a program takes to run.  Yet, curiously, the logicians hate the Booleans, and love abstractions such as trees or streams, whereas the combinatorialists love bits, and hate abstraction!  (Or so it often seems; allow me my rhetorical devices.)

My point is this.  Booleans are just one, singularly (or, perhaps, binarily) boring, type of data.  A Boolean, b, is either true, or false; that’s it.  There is no information carried by a Boolean beyond its value, and that’s the rub.  As Conor McBride puts it, to make use of a Boolean you have to know its provenance so that you can know what it means.

Booleans are almost invariably (in the CS literature and textbooks) confused with propositions, which express an assertion, or make a claim.  For a proposition, p, to be true means that it has a proof; there is a communicable, rational argument for why p is the case.  For a proposition, p, to be false means that it has a refutation; there is a counterexample that shows why p is not the case.

The language here is delicate.  The judgement that a proposition, p, is true is not the same as saying that p is equal to true, nor is the judgement that p is false the same as saying that p is equal to false!  In fact, it makes no sense to even ask the question, is p equal to true, for the former is a proposition (assertion), whereas the latter is a Boolean (data value); they are of different type.

In classical mathematics, where computability is not a concern, it is possible to conflate the notions of Booleans and propositions, so that there are only two propositions in the world, true and false.  The role of a proof is to show that a given proposition is (equal to) one of these two cases.  This is well and good, provided that you’re not interested in computing whether something is true or false.  As soon as you are, you must own up to a distinction between a general proposition, whose truth cannot be decided, and a Boolean, which can be computed one way or the other.  But then, as was mentioned earlier, we must link it up with a proposition expressing what the Boolean means, which is to specify its provenance.  And as soon as we do that, we realize that Booleans have no special status, and are usually a poor choice of data structure anyway because they lead to a condition that Dan Licata calls Boolean blindness.

A good example is the equality test, e=e’, of type bool.  As previously discussed, this seemingly innocent function presents serious complications, particularly in abstract languages with a rich type structure.  It also illustrates the standard confusion between a proposition and a Boolean.  As the type suggests, the equality test operation is a function that returns either true or false, according to whether its arguments are equal or not (in a sense determined by their type).  Although the notation invites confusion, the expression e=e’ is not a proposition stating that e and e’ are equal!  It is, rather, a piece of data of one of two forms, either true or false.  The equality test is a function that happens to have the property that if it returns true, then its arguments are equal (an associated equality proposition is true), and if it returns false, then its arguments are not equal (an associated equality proposition is false), and, moreover, is ensured to return either true or false for any inputs.  This may seem like a hair-splitting distinction, but it is not!  For consider, first of all, that the equality test may have been written with a different syntax, maybe e==e’, or (equal? e e’), or any number of other notations.  This makes clear, I hope, that the connection between the function and the associated proposition must be explicitly made by some other means; the function and the proposition are different things.  More interestingly, the same functionality can be expressed different ways.  For example, for partially ordered types we may write e<=e’ andalso e'<=e, or similar, satisfying the same specification as the innocent little “equals sign” function.  The connection is looser still here, and still looser in numerous other variations that you can spin for yourself.

What harm is there in making this confusion?  Perhaps the greatest harm is that it confuses a fundamental distinction, and this has all sorts of bad consequences.  For example, a student may reasonably ask, why is equality not defined for functions?  After all, in the proof I did the other day, I gave a carefully argument by structural induction that two functions are equal.  And then I turn around and claim that equality is not defined for functions!  What on earth am I talking about?  Well, if you draw the appropriate distinction, there is no issue.  Of course there is a well-defined concept of two functions being mathematically equal; it, in general, requires proof.  But there is no well-defined computable function that returns true iff two functions (on integers, say) are equal, and false otherwise.  Thus, = : (int->int) x (int->int) -> prop expresses equality, but = : (int->int)->(int->int)->bool cannot satisfy the specification just given.  Or, to put it another way, you cannot test equality of propositions!  (And rightly so.)

Another harm is the condition of Boolean blindness alluded to earlier.  Suppose that I evaluate the expression e=e’ to test whether e and e’ are equal.  I have in my hand a bit.  The bit itself has no intrinsic meaning; I must associate a provenance with that bit in order to give it meaning.  “This bit being true means that e and e’ are equal, whereas this other bit being false means that some other two expressions are not equal.”  Keeping track of this information (or attempting to recover it using any number of program analysis techniques) is notoriously difficult.  The only thing you can do with a bit is to branch on it, and pretty soon you’re lost in a thicket of if-the-else’s, and you lose track of what’s what.  Evolve the program a little, and you’re soon out to sea, and find yourself in need of sat solvers to figure out what the hell is going on.

But this is yet another example of an iatrogenic disorder!  The problem is computing the bit in the first place.  Having done so, you have blinded yourself by reducing the information you have at hand to a bit, and then trying to recover that information later by remembering the provenance of that bit.  An illustrative example was considered in my article on equality:

fun plus x y = if x=Z then y else S(plus (pred x) y)

Here we’ve crushed the information we have about x down to one bit, then branched on it, then were forced to recover the information we lost to justify the call to pred, which typically cannot recover the fact that its argument is non-zero and must check for it to be sure.  What a mess!  Instead, we should write

fun plus x y = case x of Z => y | S(x’) => S(plus x’ y).

No Boolean necessary, and the code is improved as well!  In particular, we obtain the predecessor en passant, and have no need to keep track of the provenance of any bit.

[Update: removed last paragraph.]