## Practical Foundations for Programming Languages, Second Edition

April 11, 2016

Today I received my copies of Practical Foundations for Programming Languages, Second Edition on Cambridge University Press.  The new edition represents a substantial revision and expansion of the first edition, including these:

1. A new chapter on type refinements has been added, complementing previous chapters on dynamic typing and on sub-typing.
2. Two old chapters were removed (general pattern matching, polarization), and several chapters were very substantially rewritten (higher kinds, inductive and co-inductive types, concurrent and distributed Algol).
3. The parallel abstract machine was revised to correct an implied extension that would have been impossible to carry out.
4. Numerous corrections and improvements were made throughout, including memorable and pronounceable names for languages.
5. Exercises were added to the end of each chapter (but the last).  Solutions are available separately.
6. The index was revised and expanded, and some conventions systematized.
7. An inexcusably missing easter egg was inserted.

I am grateful to many people for their careful reading of the text and their suggestions for correction and improvement.

In writing this book I have attempted to organize a large body of material on programming language concepts, all presented in the unifying framework of type systems and structural operational semantics.  My goal is to give precise definitions that provide a clear basis for discussion and a foundation for both analysis and implementation.  The field needs such a foundation, and I hope to have helped provide one.

## It Is What It Is (And Nothing Else)

February 22, 2016

A recent discussion of introductory computer science education led to the topic of teaching recursion.  I was surprised to learn that students are being taught that recursion requires understanding something called a “stack” that is nowhere in evidence in their code.  Few, if any, students master the concept, which is usually “covered” only briefly.  Worst, they are encouraged to believe that recursion is a mysterious bit of esoterica that is best ignored.

And thus is lost one of the most important and beautiful concepts in computing.

The discussion then moved on to the implementation of recursion in certain inexplicably popular languages for teaching programming.  As it turns out, the compilers mis-implement recursion, causing unwarranted space usage in common cases.  Recursion is dismissed as problematic and unimportant, and the compiler error is elevated to a “design principle” — to be snake-like is to do it wrong.

And thus is lost one of the most important and beautiful concepts in computing.

And yet, for all the stack-based resistance to the concept, recursion has nothing to do with a stack.  Teaching recursion does not need any mumbo-jumbo about “stacks”.  Implementing recursion does not require a “stack”.  The idea that the two concepts are related is simply mistaken.

What, then, is recursion?  It is nothing more than self-reference, the ability to name a computation for use within the computation itself.  Recursion is what it is, and nothing more.  No stacks, no tail calls, no proper or improper forms, no optimizations, just self-reference pure and simple.  Recursion is not tied to “procedures” or “functions” or “methods”; one can have self-referential values of all types.

Somehow these very simple facts, which date back to the early 1930’s, have been replaced by damaging myths that impede teaching and using recursion in programs.  It is both a conceptual and a practical loss.  For example, the most effective methods for expressing parallelism in programs rely heavily on recursive self-reference; much would be lost without it.  And the allegation that “real programmers don’t use recursion” is beyond absurd: the very concept of a digital computer is grounded in recursive self-reference (the cross-connection of gates to form a latch).  (Which, needless to say, does not involve a stack.)  Not only do real programmers use recursion, there could not even be programmers were it not for recursion.

I have no explanation for why this terrible misconception persists.  But I do know that when it comes to programming languages, attitude trumps reality every time.  Facts?  We don’t need no stinking facts around here, amigo.  You must be some kind of mathematician.

If all the textbooks are wrong, what is right?  How should one explain recursion?  It’s simple.  If you want to refer to yourself, you need to give yourself a name.  “I” will do, but so will any other name, by the miracle of α-conversion.  A computation is given a name using a fixed point (not fixpoint, dammit) operator:  fix x is e stands for the expression e named x for use within e.  Using it, the textbook example of the factorial function is written thus:

fix f is fun n : nat in case n {zero => 1 | succ(n') => n * f n'}.

Let us call this whole expression fact, for convenience.  If we wish to evaluate it, perhaps because we wish to apply it to an argument, its value is

fun n : nat in case n {zero => 1 | succ(n') => n * fact n'}.

The recursion has been unrolled one step ahead of execution.  If we reach fact again, as we will for a positive argument,  fact is evaluated again, in the same way, and the computation continues.  There are no stacks involved in this explanation.

Nor is there a stack involved in the implementation of fixed points.  It is only necessary to make sure that the named computation does indeed name itself.  This can be achieved by a number of means, including circular data structures (non-well-founded abstract syntax), but the most elegant method is by self-application.  Simply arrange that a self-referential computation has an implicit argument with which it refers to itself.  Any use of the computation unrolls the self-reference, ensuring that the invariant is maintained.  No storage allocation is required.

Consequently, a self-referential functions such as

fix f is fun (n : nat, m:nat) in case n {zero => m | succ(n') => f (n',n*m)}

execute without needing any asymptotically significant space.  It is quite literally a loop, and no special arrangement is required to make sure that this is the case.  All that is required is to implement recursion properly (as self-reference), and you’re done.  There is no such thing as tail-call optimization.  It’s not a matter of optimization, but of proper implementation.  Calling it an optimization suggests it is optional, or unnecessary, or provided only as a favor, when it is more accurately described as a matter of getting it right.

So what, then, is the source of the confusion?  The problem seems to be a too-close association between compound expressions and recursive functions or procedures.  Consider the classic definition of factorial given earlier.  The body of the definition involves the expression

n * fact n'

where there is a pending multiplication to be accounted for.  Once the recursive call (to itself) completes, the multiplication can be carried out, and it is necessary to keep track of this pending obligation.  But this phenomenon has nothing whatsoever to do with recursion.  If you write

n * square n'

then it is equally necessary to record where the external call is to return its value.  In typical accounts of recursion, the two issues get confused, a regrettable tragedy of error.

Really, the need for a stack arises the moment one introduces compound expressions.  This can be explained in several ways, none of which need pictures or diagrams or any discussion about frames or pointers or any extra-linguistic concepts whatsoever.  The best way, in my opinion, is to use Plotkin’s structural operational semantics, as described in my Practical Foundations for Programming Languages (Second Edition) on Cambridge University Press.

There is no reason, nor any possibility, to avoid recursion in programming.  But folk wisdom would have it otherwise.  That’s just the trouble with folk wisdom, everyone knows it’s true, even when it’s not.

Update: Dan Piponi and Andreas Rossberg called attention to a pertinent point regarding stacks and recursion.  The conventional notion of a run-time stack records two distinct things, the control state of the program (such as subroutine return addresses, or, more abstractly, pending computations, or continuations), and the data state of the program (a term I just made up because I don’t know a better one, for managing multiple simultaneous activations of a given procedure or function).  Fortran (back in the day) didn’t permit multiple activations, meaning that at most one instance of a procedure can be in play at a given time.  One consequence is that α-equivalence can be neglected: the arguments of a procedure can be placed in a statically determined spot for the call.  As a member of the Algol-60 design committee Dijkstra argued, successfully, for admitting multiple procedure activations (and hence, with a little extra arrangement, recursive/self-referential procedures).  Doing so requires that α-equivalence be implemented properly; two activations of the same procedure cannot share the same argument locations.  The data stack implements α-equivalence using de Bruijn indices (stack slots); arguments are passed on the data stack using activation records in the now-classic manner invented by Dijkstra for the purpose.  It is not self-reference that gives rise to the need for a stack, but rather re-entrancy of procedures, which can arise in several ways, not just recursion.  Moreover, recursion does not always require re-entrancy—the so-called tail call optimization is just the observation that certain recursive procedures are not, in fact, re-entrant.  (Every looping construct illustrates this principle, albeit on an ad hoc basis, rather than as a general principle.)

## The Power of Negative Thinking

January 27, 2015

Exception tracking is a well-known tar baby of type system design.  After all, if expressions can have two sorts of result, why shouldn’t the type say something about them both?  Languages such as CLU, FX, and Java, to name three, provide “throws” or “raises” clauses to the types of procedures that specify an upper bound on the exceptions that can occur when they are called.  It all seems natural and easy, but somehow the idea never really works very well.  One culprit is any form of higher-order programming, which is inherent in object-oriented and functional languages alike.  To handle the indirection requires more complex concepts, such as effect polymorphism, to make thing work reasonably well.  Or untracked exceptions are used to avoid the need to track them.  Somehow such an appealing idea seems rather stickier to realize than one might expect.  But why?

A piece of the puzzle was put into place by Xavier Leroy and François Pessaux in their paper on tracking uncaught exceptions. Their idea was to move use type-based methods to track uncaught exceptions, but to move the clever typing techniques required out of the programming language itself and into a separate analysis tool.  They make effective use of the powerful concept of row polymorphism introduced by Didier Rémy for typing records and variants in various dialects of Caml.  Moving exception tracking out of the language and into a verification tool is the decisive move, because it liberates the analyzer from any constraints that may be essential at the language level.

But why track uncaught exceptions?  That is, why track uncaught exceptions, rather than caught exceptions?  From a purely methodological viewpoint it seems more important to know that a certain code fragment cannot raise certain exceptions (such as the $\texttt{match}$ exception in ML, which arises when a value matches no pattern in a case analysis).  In a closed world in which all of the possible exceptions are known, then tracking positive information about which exceptions might be raised amounts to the same as tracking which exceptions cannot be raised, by simply subtracting the raised set from the entire set.  As long as the raised set is an upper bound on the exceptions that might be raised, then the difference is a lower bound on the set of exceptions that cannot be raised.  Such conservative approximations are necessary because a non-trivial behavioral property of a program is always undecidable, and hence requires proof.  In practice this means that stronger invariants must be maintained than just the exception information so that one may prove, for example, that the values passed to a pattern match are limited to those that actually do satisfy some clause of an inexhaustive match.

How realistic is the closed world assumption?  For it to hold seems to require a whole-program analysis, and is therefore non-modular, a risky premise in today’s world.  Even on a whole-program basis exceptions must be static in the sense that, even if they are scoped, they may in principle be declared globally, after suitable renaming to avoid collisions.  The global declarations collectively determine the whole “world” from which positive exception tracking information may be subtracted to obtain negative exception information.  But in languages that admit multiple instantiations of modules, such as ML functors, static exceptions are not sufficient (each instance should introduce a distinct exception).  Instead, static exceptions must be replaced by dynamic exceptions that are allocated at initialization time, or even run-time, to ensure that no collisions can occur among the instances.  At that point we have an open world of exceptions, one in which there are exceptions that may be raised, but which cannot be named in any form of type that seeks to provide an upper bound on the possible uncaught exceptions that may arise.

For example consider the ML expression

let
exception X
in
raise X
end

If one were to use positive exception tracking, what would one say about the expression as a whole?  It can, in fact it does, raise the exception $\texttt{X}$, yet this fact is unspeakable outside of the scope of the declaration.   If a tracker does not account for this fact, it is unsound in the sense that the uncaught exceptions no longer provide an upper bound on what may be raised.  One maneuver, used in Java, for example, is to admit a class of untracked exceptions about which no static information is maintained.  This is useful, because it allows one to track those exceptions that can be tracked (by the Java type system) and to not track those that cannot.

In an open world (which includes Java, because exceptions are a form of object) positive exception tracking becomes infeasible because there is no way to name the exceptions that might be tracked.  In the above example the exception $\textsf{X}$ is actually a bound variable bound to a reference to an exception constructor.  The name of the bound variable ought not matter, so it is not even clear what the exception raised should be called.  (It is amusing to see the messages generated by various ML compilers when reporting uncaught exceptions.  The information they provide is helpful, certainly, but is usually, strictly speaking, meaningless, involving identifiers that are not in scope.)

The methodological considerations mentioned earlier suggest a way around this difficulty.  Rather than attempt to track those exceptions that might be raised, instead track the exceptions that cannot be raised.  In the above example there is nothing to say about $\texttt{X}$ not being raised, because it is being raised, so we’re off the hook there.  The “dual” example

let
exception X
in
2+2
end

illustrates the power of negative thinking.  The body of the $\textsf{let}$ does not raise the exception bound to $\textsf{X}$, and this may be recorded in a type that makes sense within the scope of $\textsf{X}$.  The crucial point is that when exiting its scope it is sound to drop mention of this information in a type for the entire expression.  Information is lost, but the analysis is sound.  In contrast there is no way to drop positive information without losing soundness, as the first example shows.

One way to think about the situation is in terms of type refinements, which express properties of the behavior of expressions of a type.  To see this most clearly it is useful to separate the exception mechanism into two parts, the control part and the data part.  The control aspect is essentially just a formulation of error-passing style, in which every expression has either a normal return of a specified type, or an exceptional return of the type associated to all exceptions.  (Nick Benton and Andrew Kennedy nicely formulated this view of exceptions as an extension of the concept of a monad.)

The data aspect is, for dynamic exceptions, the type of dynamically classified values, which is written $\textsf{clsfd}$ in PFPL.  Think of it as an open-ended sum in which one can dynamically generate new classifiers (aka summands, injections, constructors, exceptions, channels, …) that carry a value of a specified type.  According to this view the exception $\textsf{X}$ is bound to a dynamically-generated classifier carrying a value of unit type.  (Classifier allocation is a storage effect, so that the data aspect necessarily involves effects, whereas the control aspect may, and, for reasons of parallelism, be taken as pure.)  Exception constructors are used to make values of type $\textsf{clsfd}$, which are passed to handlers that can deconstruct those values by pattern matching.

Type refinements come into play as a means of tracking the class of a classified value.  For the purposes of exception tracking, the crucial refinements of the type $\textsf{clsfd}$ are the positive refinement, $\textsf{a}\cdot$, and the negative refinement$\overline{\textsf{a}\cdot}$, which specify that a classified value is, or is not, of class $\textsf{a}$.  Positive exception tracking reduces to maintaining invariants expressed by a disjunction of positive refinements; negative exception tracking reduces to maintaining invariants expressed by a conjunction of negative refinements.  Revisiting the logic of exception tracking, the key is that the entailment

$\overline{\mathsf{a}_1\cdot}\wedge \cdots\wedge \overline{\mathsf{a}_n\cdot} \leq \overline{\mathsf{a}_1\cdot} \wedge \cdots \wedge \overline{\mathsf{a}_{n-1}\cdot}$

is valid, whereas the “entailment”

$\mathsf{a}_1\cdot \vee \cdots \vee \mathsf{a}_n\cdot \leq \mathsf{a}_1\cdot\vee \cdots \vee \mathsf{a}_{n-1}\cdot$

is not.  Thus, in the negative setting we may get ourselves out of the scope of an exception by weakening the refinement, an illustration of the power of negative thinking.

## Summer of Programming Languages

July 6, 2014

Having just returned from the annual Oregon Programming Languages Summer School, at which I teach every year, I am once again very impressed with the impressive growth in the technical sophistication of the field and with its ability to attract brilliant young students whose enthusiasm and idealism are inspiring.  Eugene was, as ever, an ideal setting for the summer school, providing a gorgeous setting for work and relaxation.  I was particularly glad for the numerous chances to talk with students outside of the classroom, usually over beer, and I enjoyed, as usual, the superb cycling conditions in Eugene and the surrounding countryside.  Many students commented to me that the atmosphere at the summer school is wonderful, filled with people who are passionate about programming languages research, and suffused with a spirit of cooperation and sharing of ideas.

Started by Zena Ariola a dozen years ago, this year’s instance was organized by Greg Morrisett and Amal Ahmed in consultation with Zena.  As usual, the success of the school depended critically on the dedication of Jim Allen, who has been the de facto chief operating officer since it’s inception.  Without Jim, OPLSS could not exist.  His attention to detail, and his engagement with the students are legendary.   Support from the National Science Foundation CISE Division, ACM SIGPLANMicrosoft Research, Jane Street Capital, and BAE Systems was essential for providing an excellent venue,  for supporting a roster of first-rate lecturers, and for supporting the participation of students who might otherwise not have been able to attend.  And, of course, an outstanding roster of lecturers donated their time to come to Eugene for a week to share their ideas with the students and their fellow lecturers.

The schedule of lectures is posted on the web site, all of which were taped, and are made available on the web.  In addition many speakers provided course notes, software, and other backing materials that are also available online.  So even if you were not able to attend, you can still benefit from the summer school, and perhaps feel more motivated to come next summer.  Greg and I will be organizing, in consultation with Zena.  Applying the principle “don’t fix what isn’t broken”, we do not anticipate major changes, but there is always room for improvement and the need to freshen up the content every year.  For me the central idea of the summer school is the applicability of deep theory to everyday practice.  Long a dream held by researchers such as me, these connections become more “real” every year as the theoretical abstractions of yesterday become the concrete practices of today.  It’s breathtaking to see how far we’ve come from the days when I was a student just beginning to grasp the opportunities afforded by ideas from proof theory, type theory, and category theory (the Holy Trinity) to building beautiful software systems.  No longer the abstruse fantasies of mad (computer) scientists, these ideas are the very air we breathe in PL research.  Gone are the days of ad hoc language designs done in innocence of the foundations on which they rest.  Nowadays serious industrial-strength languages are emerging that are grounded in theory and informed by practice.

Two examples have arisen just this summer, Rust (from Mozila) and Swift (from Apple), that exemplify the trend.  Although I have not had time to study them carefully, much less write serious code using them, it is evident from even a brief review of their web sites that these are serious languages that take account of the academic developments of the last couple of decades in formulating new language designs to address new classes of problems that have arisen in programming practice.  These languages are type safe, a basic criterion of sensibility, and feature sophisticated type systems that include ideas such as sum types, which have long been missing from commercial languages, or provided only in comically obtuse ways (such as objects).  The infamous null pointer mistakes have been eradicated, and the importance of pattern matching (in the sense of the ML family of languages) is finally being appreciated as the cure for Boolean blindness.  For once I can look at new industrial languages without an overwhelming sense of disappointment, but instead with optimism and enthusiasm that important ideas are finally, at long last, being recognized and adopted.  As has often been observed, it takes 25 years for an academic language idea to make it into industrial practice.  With Java it was simply the 1970’s idea of automatic storage management; with languages such as Rust and Swift we are seeing ideas from the 80’s and 90’s make their way into industrial practice.  It’s cause for celebration, and encouragement for those entering the field: the right ideas do win out in the end, one just has to have the courage to be irrelevant.

I hope to find the time to comment more meaningfully on the recent developments in practical programming languages, including Rust and Swift, but also languages such as Go and OCaml that are also making inroads into programming practice.  (The overwhelming success and future dominance of Haskell is self-evident.  Kudos!) But for now, let me say that the golden age of programming language research is here and now, and promises to continue indefinitely.

Update: word smithing.

## Parallelism and Concurrency, Revisited

April 9, 2014

I still get compliments on and criticisms of my post from three years ago (can it possibly be that long?) on parallelism and concurrency.  In that post I offered a “top down” argument to the effect that these are different abstractions with different goals: parallelism is about exploiting computational resources to maximize efficiency, concurrency is about non-deterministic composition of components in a system.  Parallelism never introduces bugs (the semantics is identical to the sequential execution), but concurrency could be said to be the mother lode of all bugs (the semantics of a component changes drastically, without careful provision, when composed concurrently with other components).  From this point of view the two concepts aren’t comparable, yet relatively few people seem to accept the distinction, or, even if they do, do not accept the terminology.

Here I’m going to try a possible explanation of why the two concepts, which seem separable to me, may seem inseparable to others.

I think that it is to do with scheduling.

One view of parallelism is that it’s just talk for concurrency, because all you do when you’re programming in parallel is fork off some threads, and then do something with their results when they’re done.  I’ve previously argued that parallelism is about cost, but let’s leave that aside.  It’s unarguable that a parallel computation does consist of a bunch of, well, parallel computations, and so it is about concurrency.  I’ve previously argued that that’s not a good way to think about concurrency either, but let’s leave that aside as well.  So, the story goes, concurrency and parallelism are synonymous, and people like me are just creating confusion.

Perhaps that is true, but here’s why it may not be a good idea to think of parallelism this way.  Scheduling as you learned about it in OS class (for example) is a altogether different than scheduling for parallelism.  There are two aspects of OS-like scheduling that I think are relevant here.  First, it is non-deterministic, and second, it is competitive.  Non-deterministic, because you have little or no control over what runs when or for how long.  A beast like the Linux scheduler is controlled by a zillion “voodoo parameters” (a turn of phrase borrowed from my queueing theory colleague, Mor Harchol-Balter), and who the hell knows what is going to happen to your poor threads once they’re in its clutches.  Second, and more importantly, an OS-like scheduler is allocating resources competitively.  You’ve got your threads, I’ve got my threads, and we both want ours to get run as soon as possible.  We’ll even pay for the privilege (priorities) if necessary.  The scheduler, and the queueing theory behind it is designed to optimize resource usage on a competitive basis, taking account of quality of service guarantees purchased by the participants.  It does not matter whether there is one processor or one thousand processors, the schedule is unpredictable.  That’s what makes concurrent programming hard: you have to program against all possible schedules.  And that’s why it’s hard to prove much about the time or space complexity of your program when it’s implemented concurrently.

Parallel scheduling is a whole ‘nother ball of wax.  It is (usually, but not necessarily) deterministic, so that you can prove bounds on its efficiency (Brent-type theorems, as discussed in a previous post and in PFPL).  And, more importantly, it is cooperative in the sense that all threads are working together for the same computation towards the same ends.  The threads are scheduled so as to get the job (there’s only one) done as quickly and as efficiently as possible.  Deterministic schedulers for parallelism are the most common, because they are the easiest to analyze with respect to their time and space bounds.  Greedy schedulers, which guarantee to maximize use of available processors, never leaving any idle when there is work to be done, form an important class for which the simple form of Brent’s Theorem is obvious.

Many deterministic greedy scheduling algorithms are known, of which I will mention p-DFS and p-BFS, which do p-at-a-time depth- and breadth-first search of the dependency graph, and various forms of work-stealing schedulers, pioneered by Charles Leiserson at MIT.  (Incidentally, if you don’t already know what p-DFS or p-BFS are, I’ll warn you that they are a little trickier than they sound.  In particular p-DFS uses a data structure that is sort of like a stack but is not a stack.)  These differ significantly in their time bounds (for example, work stealing usually involves expectation over a random variable, whereas the depth- and breadth-first traversals do not), and differ dramatically in their space complexity.  For example, p-BFS is absolutely dreadful in its space complexity.  (For a full discussion of these issues in parallel scheduling, I recommend Dan Spoonhower’s PhD Dissertation.  His semantic profiling diagrams are amazingly beautiful and informative!)

So here’s the thing: when you’re programming in parallel, you don’t just throw some threads at some non-deterministic competitive scheduler.  Rather, you generate an implicit dependency graph that a cooperative scheduler uses to maximize efficiency, end-to-end.  At the high level you do an asymptotic cost analysis without considering platform parameters such as the number of processors or the nature of the interconnect.  At the low level the implementation has to validate that cost analysis by using clever techniques to ensure that, once the platform parameters are known, maximum use is made of the computational resources to get your job done for you as fast as possible.  Not only are there no bugs introduced by the mere fact of being scheduled in parallel, but even better, you can prove a theorem that tells you how fast your program is going to run on a real platform.  Now how cool is that?

[Update: word-smithing.]

[Update: more word-smithing for clarity and concision.]

## Old Neglected Theorems Are Still Theorems

March 20, 2014

I have very recently been thinking about the question of partiality vs totality in programming languages, a perennial topic in PL’s that every generation thinks it discovers for itself.  And this got me to remembering an old theorem that, it seems, hardly anyone knows ever existed in the first place.  What I like about the theorem is that it says something specific and technically accurate about the sizes of programs in total languages compared to those in partial languages.  The theorem provides some context for discussion that does not just amount to opinion or attitude (and attitude alway seems to abound when this topic arises).

The advantage of a total programming language such as Goedel’s T is that it ensures, by type checking, that every program terminates, and that every function is total. There is simply no way to have a well-typed program that goes into an infinite loop. This may seem appealing, until one considers that the upper bound on the time to termination can be quite large, so large that some terminating programs might just as well diverge as far as we humans are concerned. But never mind that, let us grant that it is a virtue of  T that it precludes divergence.

Why, then, bother with a language such as PCF that does not rule out divergence? After all, infinite loops are invariably bugs, so why not rule them out by type checking? (Don’t be fooled by glib arguments about useful programs, such as operating systems, that “run forever”. After all, infinite streams are programmable in the language M of inductive and coinductive types in which all functions terminate. Computing infinitely does not mean running forever, it just means “for as long as one wishes, without bound.”)  The notion does seem appealing until one actually tries to write a program in a language such as T.

Consider computing the greatest common divisor (GCD) of two natural numbers. This can be easily programmed in PCF by solving the following equations using general recursion:

$\begin{array}{rcl} \textit{gcd}(m,0) & = & m \\ \textit{gcd}(0,m) & = & m \\ \textit{gcd}(m,n) & = & \textit{gcd}(m-n,n) \quad \text{if}\ m>n \\ \textit{gcd}(m,n) & = & \textit{gcd}(m,n-m) \quad \text{if}\ m

The type of $\textit{gcd}$ defined in this manner has partial function type $(\mathbb{N}\times \mathbb{N})\rightharpoonup \mathbb{N}$, which suggests that it may not terminate for some inputs. But we may prove by induction on the sum of the pair of arguments that it is, in fact, a total function.

Now consider programming this function in T. It is, in fact, programmable using only primitive recursion, but the code to do it is rather painful (try it!). One way to see the problem is that in T the only form of looping is one that reduces a natural number by one on each recursive call; it is not (directly) possible to make a recursive call on a smaller number other than the immediate predecessor. In fact one may code up more general patterns of terminating recursion using only primitive recursion as a primitive, but if you examine the details, you will see that doing so comes at a significant price in performance and program complexity. Program complexity can be mitigated by building libraries that codify standard patterns of reasoning whose cost of development should be amortized over all programs, not just one in particular. But there is still the problem of performance. Indeed, the encoding of more general forms of recursion into primitive recursion means that, deep within the encoding, there must be “timer” that “goes down by ones” to ensure that the program terminates. The result will be that programs written with such libraries will not be nearly as fast as they ought to be.  (It is actually quite fun to derive “course of values” recursion from primitive recursion, and then to observe with horror what is actually going on, computationally, when using this derived notion.)

But, one may argue, T is simply not a serious language. A more serious total programming language would admit sophisticated patterns of control without performance penalty. Indeed, one could easily envision representing the natural numbers in binary, rather than unary, and allowing recursive calls to be made by halving to achieve logarithmic complexity. This is surely possible, as are numerous other such techniques. Could we not then have a practical language that rules out divergence?

We can, but at a cost.  One limitation of total programming languages is that they are not universal: you cannot write an interpreter for T within T (see Chapter 9 of PFPL for a proof).  More importantly, this limitation extends to any total language whatever.  If this limitation does not seem important, then consider the Blum Size Theorem (BST) (from 1967), which places a very different limitation on total languages.  Fix any total language, L, that permits writing functions on the natural numbers. Pick any blowup factor, say $2^{2^n}$, or however expansive you wish to be.  The BST states that there is a total function on the natural numbers that is programmable in L, but whose shortest program in L is larger by the given blowup factor than its shortest program in PCF!

The underlying idea of the proof is that in a total language the proof of termination of a program must be baked into the code itself, whereas in a partial language the termination proof is an external verification condition left to the programmer. Roughly speaking, there are, and always will be, programs whose termination proof is rather complicated to express, if you fix in advance the means by which it may be proved total. (In T it was primitive recursion, but one can be more ambitious, yet still get caught by the BST.)  But if you leave room for ingenuity, then programs can be short, precisely because they do not have to embed the proof of their termination in their own running code.

There are ways around the BST, of course, and I am not saying otherwise.  For example, the BST merely guarantees the existence of a bad case, so one can always argue that such a case will never arise in practice.  Could be, but I did mention the GCD in T problem for a reason: there are natural problems that are difficult to express in a language such as T.  By fixing the possible termination arguments in advance, one is tempting fate, for there are many problems, such as the Collatz Conjecture, for which the termination proof of a very simple piece of code has been an open problem for decades, and has resisted at least some serious attempts on it.  One could argue that such a function is of no practical use.  I agree, but I point out the example not to say that it is useful, but to say that it is likely that its eventual termination proof will be quite nasty, and that this will have to be reflected in the program itself if you are limited to a T-like language (rendering it, once again, useless).  For another example, there is no inherent reason why termination need be assured by means similar to that used in T.  We got around this issue in NuPRL by separating the code from the proof, using a type theory based on a partial programming language, not a total one.  The proof of termination is still required for typing in the core theory (but not in the theory with “bar types” for embracing partiality).  But it’s not baked into the code itself, affecting its run-time; it is “off to the side”, large though it may be).

Updates: word smithing, fixed bad link, corrected gcd, removed erroneous parenthetical reference to Coq, fixed LaTeX problems.

## Exceptions Are Shared Secrets

December 3, 2012

Exceptions are commonly criticized as being the “goto’s” of modern programming languages.  Raising an exception transfers control to an unknown destination, it is said, and this is a bad thing on engineering grounds.  I disagree.  It is perfectly predictable where a raised exception will be handled, provided that exceptions are done properly, which, unfortunately, is not always the case.

The crucial point that exception values are shared secrets. Let us distinguish two parties, the raiser of the exception, and the handler of it. The fundamental idea of exceptions is to transfer a value from the raiser to the handler without the possibility of interception by another party. While the language of secrecy seems appropriately evocative, I hasten to add that I am not here concerned with “attackers” or suchlike, but merely with the difficulties of ensuring modular composition of programs from components. In such a setting the “attacker” is yourself, who is not malicious, but who is fallible.

By raising an exception the raiser is “contacting” a handler with a message. The raiser wishes to limit which components of a program may intercept that message. More precisely, the raiser wishes to ensure that only certain previously agreed-upon components may handle that exception, perhaps only one. This property should remain stable under extension to the program or composition with any other component. It should not be possible for an innocent third party to accidentally intercept a message that was not intended for it.

Achieving this requires a secrecy mechanism that allows the raiser and the handler(s) to agree upon their cooperation. This is accomplished by dynamic classification, exactly as it is done properly in Standard ML (but not O’Caml). The idea is that the raiser has access to a dynamically generated constructor for exception values, and any handler has access to the corresponding dynamically generated matcher for exception values. This means that the handler, and only the handler, can decode the message sent by the raiser; no other party can do anything with it other than pass it along unexamined. It is “perfectly encrypted” and cannot be deciphered by any unintended component.

The usual exception mechanisms, as distinct from exception values, allow for “wild-card handlers”, which means that an exception can be intercepted by a third party. This means that the raiser cannot ensure that the handler actually receives the message, but it can ensure, using dynamic classification, that only a legitimate handler may decipher it. Decades of experience with Standard ML shows that this is a very useful thing indeed, and has application far beyond just the simple example considered here. For full details, see my forthcoming book, for a full discussion of dynamic classification and its role for ensuring integrity and confidentiality in a program. Dynamic classification is not just for “security”, but is rather a good tool for everyday programming.

Haskell is one language that does not get exceptions right.  Allocating an exception incurs a storage effect, and so would have to be confined to the IO monad in Haskell.  But this would destroy the utility of exceptions in pure code.  The result is an exception mechanism that is arguably broken and that does not provide the guarantees that make exceptions a perfectly pleasant and useful way to write code.

Update: Reworked last paragraph to clarify the point I am making; the previous formulation appears to have invited misinterpretation.

Update: This account of exceptions also makes clear why the perennial suggestion to put exception-raising information into types makes no sense to me. I will write more about this in a future post, but meanwhile contemplate that a computation may raise an exception that is not even in principle nameable in the type. That is, it is not conservativity that’s at issue, it’s the very idea.

Update: Wordsmithing, removal of irrelevant remarks to focus on the main point about dynamic exceptions.

Follow