## Old Neglected Theorems Are Still Theorems

March 20, 2014

I have very recently been thinking about the question of partiality vs totality in programming languages, a perennial topic in PL’s that every generation thinks it discovers for itself.  And this got me to remembering an old theorem that, it seems, hardly anyone knows ever existed in the first place.  What I like about the theorem is that it says something specific and technically accurate about the sizes of programs in total languages compared to those in partial languages.  The theorem provides some context for discussion that does not just amount to opinion or attitude (and attitude alway seems to abound when this topic arises).

The advantage of a total programming language such as Goedel’s T is that it ensures, by type checking, that every program terminates, and that every function is total. There is simply no way to have a well-typed program that goes into an infinite loop. This may seem appealing, until one considers that the upper bound on the time to termination can be quite large, so large that some terminating programs might just as well diverge as far as we humans are concerned. But never mind that, let us grant that it is a virtue of  T that it precludes divergence.

Why, then, bother with a language such as PCF that does not rule out divergence? After all, infinite loops are invariably bugs, so why not rule them out by type checking? (Don’t be fooled by glib arguments about useful programs, such as operating systems, that “run forever”. After all, infinite streams are programmable in the language M of inductive and coinductive types in which all functions terminate. Computing infinitely does not mean running forever, it just means “for as long as one wishes, without bound.”)  The notion does seem appealing until one actually tries to write a program in a language such as T.

Consider computing the greatest common divisor (GCD) of two natural numbers. This can be easily programmed in PCF by solving the following equations using general recursion:

$\begin{array}{rcl} \textit{gcd}(m,0) & = & m \\ \textit{gcd}(0,m) & = & m \\ \textit{gcd}(m,n) & = & \textit{gcd}(m-n,n) \quad \text{if}\ m>n \\ \textit{gcd}(m,n) & = & \textit{gcd}(m,n-m) \quad \text{if}\ m

The type of $\textit{gcd}$ defined in this manner has partial function type $(\mathbb{N}\times \mathbb{N})\rightharpoonup \mathbb{N}$, which suggests that it may not terminate for some inputs. But we may prove by induction on the sum of the pair of arguments that it is, in fact, a total function.

Now consider programming this function in T. It is, in fact, programmable using only primitive recursion, but the code to do it is rather painful (try it!). One way to see the problem is that in T the only form of looping is one that reduces a natural number by one on each recursive call; it is not (directly) possible to make a recursive call on a smaller number other than the immediate predecessor. In fact one may code up more general patterns of terminating recursion using only primitive recursion as a primitive, but if you examine the details, you will see that doing so comes at a significant price in performance and program complexity. Program complexity can be mitigated by building libraries that codify standard patterns of reasoning whose cost of development should be amortized over all programs, not just one in particular. But there is still the problem of performance. Indeed, the encoding of more general forms of recursion into primitive recursion means that, deep within the encoding, there must be “timer” that “goes down by ones” to ensure that the program terminates. The result will be that programs written with such libraries will not be nearly as fast as they ought to be.  (It is actually quite fun to derive “course of values” recursion from primitive recursion, and then to observe with horror what is actually going on, computationally, when using this derived notion.)

But, one may argue, T is simply not a serious language. A more serious total programming language would admit sophisticated patterns of control without performance penalty. Indeed, one could easily envision representing the natural numbers in binary, rather than unary, and allowing recursive calls to be made by halving to achieve logarithmic complexity. This is surely possible, as are numerous other such techniques. Could we not then have a practical language that rules out divergence?

We can, but at a cost.  One limitation of total programming languages is that they are not universal: you cannot write an interpreter for T within T (see Chapter 9 of PFPL for a proof).  More importantly, this limitation extends to any total language whatever.  If this limitation does not seem important, then consider the Blum Size Theorem (BST) (from 1967), which places a very different limitation on total languages.  Fix any total language, L, that permits writing functions on the natural numbers. Pick any blowup factor, say $2^{2^n}$, or however expansive you wish to be.  The BST states that there is a total function on the natural numbers that is programmable in L, but whose shortest program in L is larger by the given blowup factor than its shortest program in PCF!

The underlying idea of the proof is that in a total language the proof of termination of a program must be baked into the code itself, whereas in a partial language the termination proof is an external verification condition left to the programmer. Roughly speaking, there are, and always will be, programs whose termination proof is rather complicated to express, if you fix in advance the means by which it may be proved total. (In T it was primitive recursion, but one can be more ambitious, yet still get caught by the BST.)  But if you leave room for ingenuity, then programs can be short, precisely because they do not have to embed the proof of their termination in their own running code.

There are ways around the BST, of course, and I am not saying otherwise.  For example, the BST merely guarantees the existence of a bad case, so one can always argue that such a case will never arise in practice.  Could be, but I did mention the GCD in T problem for a reason: there are natural problems that are difficult to express in a language such as T.  By fixing the possible termination arguments in advance, one is tempting fate, for there are many problems, such as the Collatz Conjecture, for which the termination proof of a very simple piece of code has been an open problem for decades, and has resisted at least some serious attempts on it.  One could argue that such a function is of no practical use.  I agree, but I point out the example not to say that it is useful, but to say that it is likely that its eventual termination proof will be quite nasty, and that this will have to be reflected in the program itself if you are limited to a T-like language (rendering it, once again, useless).  For another example, there is no inherent reason why termination need be assured by means similar to that used in T.  We got around this issue in NuPRL by separating the code from the proof, using a type theory based on a partial programming language, not a total one.  The proof of termination is still required for typing in the core theory (but not in the theory with “bar types” for embracing partiality).  But it’s not baked into the code itself, affecting its run-time; it is “off to the side”, large though it may be).

Updates: word smithing, fixed bad link, corrected gcd, removed erroneous parenthetical reference to Coq, fixed LaTeX problems.

## Intro Curriculum Update

August 17, 2012

In previous posts I have talked about the new introductory CS curriculum under development at Carnegie Mellon. After a year or so of planning, we began to roll out the new curriculum in the Spring of 2011, and have by now completed the transition. As mentioned previously, the main purpose is to bring the introductory sequence up to date, with particular emphasis on introducing parallelism and verification. A secondary purpose was to restore the focus on computing fundamentals, and correct the drift towards complex application frameworks that offer the students little sense of what is really going on. (The poster child was a star student who admitted that, although she had built a web crawler the previous semester, she in fact has no idea how to build a web crawler.) A particular problem is that what should have been a grounding in the fundamentals of algorithms and data structures turned into an exercise in object-oriented programming, swamping the core content with piles of methodology of dubious value to beginning students. (There is a new, separate, upper-division course on oo methodology for students interested in this topic.) A third purpose was to level the playing field, so that students who had learned about programming on the street were equally as challenged, if not more so, than students without much or any such experience. One consequence would be to reduce the concomitant bias against women entering CS, many fewer of whom having prior computing experience than the men.

The solution was a complete do-over, jettisoning the traditional course completely, and starting from scratch. The most important decision was to emphasize functional programming right from the start, and to build on this foundation for teaching data structures and algorithms. Not only does FP provide a much more natural starting point for teaching programming, it is infinitely more amenable to rigorous verification, and provides a natural model for parallel computation. Every student comes to university knowing some algebra, and they are therefore familiar with the idea of computing by calculation (after all, the word algebra derives from the Arabic al jabr, meaning system of calculation). Functional programming is a generalization of algebra, with a richer variety of data structures and a richer set of primitives, so we can build on that foundation. It is critically important that variables in FP are, in fact, mathematical variables, and not some distorted facsimile thereof, so all of their mathematical intuitions are directly applicable. So we can immediately begin discussing verification as a natural part of programming, using principles such as mathematical induction and equational reasoning to guide their thinking. Moreover, there are natural concepts of sequential time complexity, given by the number of steps required to calculate an answer, and parallel time complexity, given by the data dependencies in a computation (often made manifest by the occurrences of variables). These central concepts are introduced in the first week, and amplified throughout the semester.

Two major homework exercises embody the high points of the first-semester course, one to do with co-development of code with proof, the other to do with parallelism. Co-development of program and proof is illustrated by an online regular expression matcher. The problem is a gem for several reasons. One is that it is essentially impossible for anyone to solve by debugging a blank screen. This sets us up nicely for explaining the importance of specification and proof as part of the development process. Another is that it is very easy, almost inevitable, for students to make mistakes that are not easily caught or diagnosed by testing. We require the students to carry out a proof of the correctness of the matcher, and in the process discover a point where the proof breaks down, which then leads to a proper solution. (See my JFP paper “Proof-Directed Debugging” for a detailed development of the main ideas.) The matcher also provides a very nice lesson in the use of higher-order functions to capture patterns of control, resulting in an extremely clean and simple matcher whose correctness proof is highly non-trivial.

The main parallelism example is the Barnes-Hut algorithm for solving the n-body problem in physics. Barnes-Hut is an example of a “tree-based” method, invented by Andrew Appel, for solving this well-known problem. At a high level the main idea is to decompose space into octants (or quadrants if you’re working in the plane), recursively solving the problem for each octant and then combining the solutions to make an overall solution. The key idea is to use an approximation for bodies that are far enough away—a distant constellation can be regarded as an atomic body for the purposes of calculating the effects of its stars on the sun, say. The problem is naturally parallelizable, because of the recursive decomposition. Moreover, it provides a very nice link to their high school physics. Since FP is just an extension of mathematics, the equations specifying the force law and Newton’s Law carry over directly to the code. This is an important sense in which FP builds on and amplifies their prior mathematical experience, and shows how one may connect computer science with other sciences in a rather direct manner.

The introductory FP course establishes the basis for the new data structures and algorithms course that most students take in either the third or fourth semester. This course represents a radical departure from tradition in several ways. First, it is a highly rigorous course in algorithms that rivals the upper-division course taught at most universities (including our own) for the depth and breadth of ideas it develops. Second, it takes the stance that all algorithms are parallel algorithms, with sequential being but a special case of parallel. Of course some algorithms have a better parallelizability ratio (a precise technical characterization of the ability to make use of parallelism), and this can be greatly affected by data structure selection, among other considerations. Third, the emphasis is on persistent abstract types, which are indispensable for parallel computing. No more linked lists, no more null pointer nonsense, no more mutation. Instead we consider abstract types of graphs, trees, heaps, and so forth, all with a persistent semantics (operations create “new” ones, rather than modify “old” ones). Fourth, we build on the reasoning methods introduced in the first semester course to prove the correctness and the efficiency of algorithms. Functional programming makes all of this possible. Programs are concise and amenable to proof, they are naturally parallel, and they enjoy powerful typing properties that enforce abstraction in a mathematically rigorous manner. Fifth, there is a strong emphasis on problems of practical interest. For example, homework 1 is the shotgun method for genome sequencing, a parallel algorithm of considerable practical importance and renown.

There is a third introductory course in imperative programming, taken in either the first or second semester (alternating with the functional programming course at the student’s discretion), that teaches the “old ways” of doing things using a “safe” subset of C. Personally, I think this style of programming is obsolete, but there are many good reasons to continue to teach it, the biggest probably being the legacy problem. The emphasis is on verification, using simple assertions that are checked at run-time and which may be verified statically in some situations. It is here that students learn how to do things “the hard way” using low-level representations. This course is very much in the style of the 1970’s era data structures course, the main difference being that the current incarnation of Pascal has curly braces, rather than begin-end.

For the sake of those readers who may not be up to date on such topics, it seems important to emphasize that functional programming subsumes imperative programming. Every functional language is capable of expressing the old-fashioned sequential pointer-hacking implementation of data structures. You can even reproduce Tony Hoare’s self-described “billion dollar mistake” of the cursed “null pointer” if you’d like! But the whole point is that it is rarely useful, and almost never elegant, to work that way. (Curiously, the “monad mania” in the Haskell community stresses an imperative, sequential style of programming that is incompatible with parallelism, but this is not forced on you as it is in the imperative world.) From this point of view there no loss, and considerable gain, in teaching functional programming from the start. It puts a proper emphasis on mathematically sane programming methods, but allows for mutation-based programming where appropriate (for example, in engendering “side effects” on users).

I encourage readers to review the syllabi and course materials. There is quite a large body of material in place that we plan to expand into textbook-level treatments in the near future. We also plan to write a journal article documenting our experiences with these courses.

I am very grateful to my colleagues Guy Blelloch, Dan Licata, and Frank Pfenning for their efforts in helping to develop the new introductory curriculum at Carnegie Mellon, and to the teaching assistants whose hard work and dedication put the ideas into practice.

Update: Unfortunately, the homework assignments for these courses are not publicly available, because we want to minimize the temptation for students to make use of old assignments and solutions in preparing their own work.  I am working with my colleagues to find some way in which we can promote the ideas without sacrificing too much of the integrity of the course materials.  I apologize for the inconvenience.

## Some Thoughts on Teaching FP

April 17, 2011

A number of people have asked some very good questions about the details of how we teach certain concepts in our new functional programming class for freshmen at Carnegie Mellon.  Rather than spray responses among the various comments, I’ll summarize a few major points here in hopes of helping others who may wish to teach a similar course.  So this post is not really meant for a broad audience, but rather for the specialist; feel free to skip it if it seems too focused for your interests.  I promise to write some more controversial material of more general interest soon!  Meanwhile, here are a few thoughts presented in no particular order of importance.

Moving to a more technical level, the use of structural operational semantics is indispensable for providing a rigorous foundation for understanding program execution, reasoning about program correctness, and for defining a cost model to support reasoning about asymptotic complexity.  There is no substitute for this!  Without a crisp formulation of the semantics of the language, it is impossible to discuss any of these issues in a meaningful and technically precise way.  With it you can readily resolve any questions about “what happens if …”, giving the students a tool that they can use themselves to answer such questions.  Moreover, as program verification becomes more important in industrial practice, as well as academic research, it is essential that students be educated in the tools of semantics.  Structural operational semantics is very easy to teach, and presents no problems for the students.  We just use it, and they get it without any fuss or bother.  It is a natural extension of their experience with high school algebra.  Be not afraid of using these tools to teach programming!

As I’ve explained previously, it is a very good idea to avoid Booleans as much as possible.  And, above all, don’t mention equality!  The equals sign in programming languages is not the equals sign of mathematics.  Propositions are not Booleans, and it only confuses matters to use notations that encourage this misconception.  Related to this, avoid if-then-else entirely, and instead use only case analysis for branching, even when the value to be discriminated is a Boolean.  We consistently write things like

case Int.compare(x,y) of
LESS => ...
| GREATER => ...
| EQUAL => ...

rather than a nested conditional branch.  It encourages students to think in terms of pattern matching, and prepares the ground for later developments, including a smooth transition to pattern matching over more complex data structures and reasoning inductively when programming recursively.

Teaching parallelism is completely straightforward, because the model of computation inherently avoids unnatural and unnecessary problems of interference, and focuses attention on the critical issue of data dependencies among computations in a program.  Students have no trouble computing the work (sequential time complexity) or span (parallel time complexity) of a program, and have no problems reading off recurrences for the respective time complexities.  Later, when we introduce sequences, the idea of computing in parallel with the entire sequence, rather than item-by-iterm (as encouraged by the dreadful iterators so beloved in the oo world), comes naturally and easily.  The key to this, of course, is that data structures in a functional language are naturally persistent; it is monstrously hard to use ephemeral data structures in a parallel computation, and is not something we could hope to teach freshmen.

A major decision for us is how to teach the expression and enforcement of abstraction in a program.  In a departure from our previous approach, we have decided against using opaque ascription (sealing) as a means of enforcing abstraction.  It has its virtues, but the problem is that it does not mesh well with other language features, in particular with substructures and type classes (views).  For example, consider the signature of a mapping whose domain is an ordered type of keys:

signature MAPPING = sig
structure Key : ORDERED
type 'a mapping
val lookup : Key.t -> 'a mapping -> 'a
...
end

Unfortunately, sealing a structure with this signature renders the module useless:

structure IntMapping :> MAPPING = struct
structure Key = Int
type 'a mapping = 'a bst
...
end

The trouble is that not only is the type ‘a IntMapping.mapping abstract, as intended, but so is IntMapping.Key.t, which is not at all intended!  To get around this we we must create a specialization of the signature MAPPING using one of several means such as

signature INT_MAPPING = MAPPING where type Key.t=int
structure IntMapping :> INT_MAPPING = ...

Of course one need not name the signature, but this illustrates the general problem.  As things get more complicated, you have more and more clauses that specify the types of things (sharing specifications).

The alternative, which has worked very well for us, is to eschew opaque ascription, and instead rely on the datatype mechanism to make types abstract.  So to give an implementation of the abstract type of mappings with keys being integers, we proceed as follows:

structure IntMapping : MAPPING = struct
structure Key : ORDERED = Int
datatype 'a bst = Empty | Node of 'a bst * (Key.t * 'a) * 'a bst
type 'a mapping = 'a bst
val insert = ...
end

The point is that since the constructors of the type ‘a bst are not exported in the interface, the type ‘a IntMapping.mapping is abstract.  Note as well that the use of transparent ascription on the structure Key ensures that keys really are integers (of type Int.int), and are not abstract, exactly as intended.  This formulation allows us to state simple rules of signature matching (every specification in the signature has a corresponding declaration in the structure), and allows us to enforce abstraction boundaries with a minimum of fuss.  The students have had absolutely no trouble with this at all, and we have had no trouble structuring our code this way.

When using functors (parameterized modules) to combine modules it is, of course, necessary to impose sharing constraints to ensure that only coherent compositions are possible.  (Rather than take the space to explain this here, I will rely on the reader’s experience to understand what is involved here.)  These sorts of sharing specifications are perfectly natural, easily explained, and have presented absolutely no difficulties for the students.  We illustrated their use in our game tree search example, in which the “referee” module is parameterized by the two “player” modules, which must of course cohere on their concept of a “game” (it’s no use pitting a chess player against a checkers player!).  The code looks like this

functor Referee
(structure Player1 : PLAYER and Player2 : PLAYER
sharing type Player1.Game.t = Player2.Game.t) : REFEREE = ...

The sharing specification states precisely and concisely the natural coherence constraint governing the two players.  Here again, the dog we feared never barked, and the students found it all quite intuitive and unproblematic.  This allowed them to expend their time on the actual complexities of the problem at hand, such as how to think about alpha-beta pruning in a parallel game-tree search, rather than get all tied up with the bureaucracy of structuring the code itself.

The virtue of teaching bright young students is that they are bright and they are young.  Their brilliance is, of course, a pleasure.  We have to work hard to come up with sufficiently challenging exercises, and many students challenge us with their solutions to our problems.  Their youth means that they come to us with a minimum of misconceptions and misinformation that they’ve picked up on the street, and are open to learning methods that are entirely non-standard (at least for now) with respect to what their friends are learning at other universities.  What makes Carnegie Mellon a special place is precisely that the students are pushed into thinking hard in ways that they might not be.  Personally, I hope that more universities worldwide build on what we have started, and give their students the same benefits that ours are already enjoying.

Follow