## The Power of Negative Thinking

January 27, 2015

Exception tracking is a well-known tar baby of type system design.  After all, if expressions can have two sorts of result, why shouldn’t the type say something about them both?  Languages such as CLU, FX, and Java, to name three, provide “throws” or “raises” clauses to the types of procedures that specify an upper bound on the exceptions that can occur when they are called.  It all seems natural and easy, but somehow the idea never really works very well.  One culprit is any form of higher-order programming, which is inherent in object-oriented and functional languages alike.  To handle the indirection requires more complex concepts, such as effect polymorphism, to make thing work reasonably well.  Or untracked exceptions are used to avoid the need to track them.  Somehow such an appealing idea seems rather stickier to realize than one might expect.  But why?

A piece of the puzzle was put into place by Xavier Leroy and François Pessaux in their paper on tracking uncaught exceptions. Their idea was to move use type-based methods to track uncaught exceptions, but to move the clever typing techniques required out of the programming language itself and into a separate analysis tool.  They make effective use of the powerful concept of row polymorphism introduced by Didier Rémy for typing records and variants in various dialects of Caml.  Moving exception tracking out of the language and into a verification tool is the decisive move, because it liberates the analyzer from any constraints that may be essential at the language level.

But why track uncaught exceptions?  That is, why track uncaught exceptions, rather than caught exceptions?  From a purely methodological viewpoint it seems more important to know that a certain code fragment cannot raise certain exceptions (such as the $\texttt{match}$ exception in ML, which arises when a value matches no pattern in a case analysis).  In a closed world in which all of the possible exceptions are known, then tracking positive information about which exceptions might be raised amounts to the same as tracking which exceptions cannot be raised, by simply subtracting the raised set from the entire set.  As long as the raised set is an upper bound on the exceptions that might be raised, then the difference is a lower bound on the set of exceptions that cannot be raised.  Such conservative approximations are necessary because a non-trivial behavioral property of a program is always undecidable, and hence requires proof.  In practice this means that stronger invariants must be maintained than just the exception information so that one may prove, for example, that the values passed to a pattern match are limited to those that actually do satisfy some clause of an inexhaustive match.

How realistic is the closed world assumption?  For it to hold seems to require a whole-program analysis, and is therefore non-modular, a risky premise in today’s world.  Even on a whole-program basis exceptions must be static in the sense that, even if they are scoped, they may in principle be declared globally, after suitable renaming to avoid collisions.  The global declarations collectively determine the whole “world” from which positive exception tracking information may be subtracted to obtain negative exception information.  But in languages that admit multiple instantiations of modules, such as ML functors, static exceptions are not sufficient (each instance should introduce a distinct exception).  Instead, static exceptions must be replaced by dynamic exceptions that are allocated at initialization time, or even run-time, to ensure that no collisions can occur among the instances.  At that point we have an open world of exceptions, one in which there are exceptions that may be raised, but which cannot be named in any form of type that seeks to provide an upper bound on the possible uncaught exceptions that may arise.

For example consider the ML expression

let
exception X
in
raise X
end

If one were to use positive exception tracking, what would one say about the expression as a whole?  It can, in fact it does, raise the exception $\texttt{X}$, yet this fact is unspeakable outside of the scope of the declaration.   If a tracker does not account for this fact, it is unsound in the sense that the uncaught exceptions no longer provide an upper bound on what may be raised.  One maneuver, used in Java, for example, is to admit a class of untracked exceptions about which no static information is maintained.  This is useful, because it allows one to track those exceptions that can be tracked (by the Java type system) and to not track those that cannot.

In an open world (which includes Java, because exceptions are a form of object) positive exception tracking becomes infeasible because there is no way to name the exceptions that might be tracked.  In the above example the exception $\textsf{X}$ is actually a bound variable bound to a reference to an exception constructor.  The name of the bound variable ought not matter, so it is not even clear what the exception raised should be called.  (It is amusing to see the messages generated by various ML compilers when reporting uncaught exceptions.  The information they provide is helpful, certainly, but is usually, strictly speaking, meaningless, involving identifiers that are not in scope.)

The methodological considerations mentioned earlier suggest a way around this difficulty.  Rather than attempt to track those exceptions that might be raised, instead track the exceptions that cannot be raised.  In the above example there is nothing to say about $\texttt{X}$ not being raised, because it is being raised, so we’re off the hook there.  The “dual” example

let
exception X
in
2+2
end

illustrates the power of negative thinking.  The body of the $\textsf{let}$ does not raise the exception bound to $\textsf{X}$, and this may be recorded in a type that makes sense within the scope of $\textsf{X}$.  The crucial point is that when exiting its scope it is sound to drop mention of this information in a type for the entire expression.  Information is lost, but the analysis is sound.  In contrast there is no way to drop positive information without losing soundness, as the first example shows.

One way to think about the situation is in terms of type refinements, which express properties of the behavior of expressions of a type.  To see this most clearly it is useful to separate the exception mechanism into two parts, the control part and the data part.  The control aspect is essentially just a formulation of error-passing style, in which every expression has either a normal return of a specified type, or an exceptional return of the type associated to all exceptions.  (Nick Benton and Andrew Kennedy nicely formulated this view of exceptions as an extension of the concept of a monad.)

The data aspect is, for dynamic exceptions, the type of dynamically classified values, which is written $\textsf{clsfd}$ in PFPL.  Think of it as an open-ended sum in which one can dynamically generate new classifiers (aka summands, injections, constructors, exceptions, channels, …) that carry a value of a specified type.  According to this view the exception $\textsf{X}$ is bound to a dynamically-generated classifier carrying a value of unit type.  (Classifier allocation is a storage effect, so that the data aspect necessarily involves effects, whereas the control aspect may, and, for reasons of parallelism, be taken as pure.)  Exception constructors are used to make values of type $\textsf{clsfd}$, which are passed to handlers that can deconstruct those values by pattern matching.

Type refinements come into play as a means of tracking the class of a classified value.  For the purposes of exception tracking, the crucial refinements of the type $\textsf{clsfd}$ are the positive refinement, $\textsf{a}\cdot$, and the negative refinement$\overline{\textsf{a}\cdot}$, which specify that a classified value is, or is not, of class $\textsf{a}$.  Positive exception tracking reduces to maintaining invariants expressed by a disjunction of positive refinements; negative exception tracking reduces to maintaining invariants expressed by a conjunction of negative refinements.  Revisiting the logic of exception tracking, the key is that the entailment

$\overline{\mathsf{a}_1\cdot}\wedge \cdots\wedge \overline{\mathsf{a}_n\cdot} \leq \overline{\mathsf{a}_1\cdot} \wedge \cdots \wedge \overline{\mathsf{a}_{n-1}\cdot}$

is valid, whereas the “entailment”

$\mathsf{a}_1\cdot \vee \cdots \vee \mathsf{a}_n\cdot \leq \mathsf{a}_1\cdot\vee \cdots \vee \mathsf{a}_{n-1}\cdot$

is not.  Thus, in the negative setting we may get ourselves out of the scope of an exception by weakening the refinement, an illustration of the power of negative thinking.

## Summer of Programming Languages

July 6, 2014

Having just returned from the annual Oregon Programming Languages Summer School, at which I teach every year, I am once again very impressed with the impressive growth in the technical sophistication of the field and with its ability to attract brilliant young students whose enthusiasm and idealism are inspiring.  Eugene was, as ever, an ideal setting for the summer school, providing a gorgeous setting for work and relaxation.  I was particularly glad for the numerous chances to talk with students outside of the classroom, usually over beer, and I enjoyed, as usual, the superb cycling conditions in Eugene and the surrounding countryside.  Many students commented to me that the atmosphere at the summer school is wonderful, filled with people who are passionate about programming languages research, and suffused with a spirit of cooperation and sharing of ideas.

Started by Zena Ariola a dozen years ago, this year’s instance was organized by Greg Morrisett and Amal Ahmed in consultation with Zena.  As usual, the success of the school depended critically on the dedication of Jim Allen, who has been the de facto chief operating officer since it’s inception.  Without Jim, OPLSS could not exist.  His attention to detail, and his engagement with the students are legendary.   Support from the National Science Foundation CISE Division, ACM SIGPLANMicrosoft Research, Jane Street Capital, and BAE Systems was essential for providing an excellent venue,  for supporting a roster of first-rate lecturers, and for supporting the participation of students who might otherwise not have been able to attend.  And, of course, an outstanding roster of lecturers donated their time to come to Eugene for a week to share their ideas with the students and their fellow lecturers.

The schedule of lectures is posted on the web site, all of which were taped, and are made available on the web.  In addition many speakers provided course notes, software, and other backing materials that are also available online.  So even if you were not able to attend, you can still benefit from the summer school, and perhaps feel more motivated to come next summer.  Greg and I will be organizing, in consultation with Zena.  Applying the principle “don’t fix what isn’t broken”, we do not anticipate major changes, but there is always room for improvement and the need to freshen up the content every year.  For me the central idea of the summer school is the applicability of deep theory to everyday practice.  Long a dream held by researchers such as me, these connections become more “real” every year as the theoretical abstractions of yesterday become the concrete practices of today.  It’s breathtaking to see how far we’ve come from the days when I was a student just beginning to grasp the opportunities afforded by ideas from proof theory, type theory, and category theory (the Holy Trinity) to building beautiful software systems.  No longer the abstruse fantasies of mad (computer) scientists, these ideas are the very air we breathe in PL research.  Gone are the days of ad hoc language designs done in innocence of the foundations on which they rest.  Nowadays serious industrial-strength languages are emerging that are grounded in theory and informed by practice.

Two examples have arisen just this summer, Rust (from Mozila) and Swift (from Apple), that exemplify the trend.  Although I have not had time to study them carefully, much less write serious code using them, it is evident from even a brief review of their web sites that these are serious languages that take account of the academic developments of the last couple of decades in formulating new language designs to address new classes of problems that have arisen in programming practice.  These languages are type safe, a basic criterion of sensibility, and feature sophisticated type systems that include ideas such as sum types, which have long been missing from commercial languages, or provided only in comically obtuse ways (such as objects).  The infamous null pointer mistakes have been eradicated, and the importance of pattern matching (in the sense of the ML family of languages) is finally being appreciated as the cure for Boolean blindness.  For once I can look at new industrial languages without an overwhelming sense of disappointment, but instead with optimism and enthusiasm that important ideas are finally, at long last, being recognized and adopted.  As has often been observed, it takes 25 years for an academic language idea to make it into industrial practice.  With Java it was simply the 1970’s idea of automatic storage management; with languages such as Rust and Swift we are seeing ideas from the 80’s and 90’s make their way into industrial practice.  It’s cause for celebration, and encouragement for those entering the field: the right ideas do win out in the end, one just has to have the courage to be irrelevant.

I hope to find the time to comment more meaningfully on the recent developments in practical programming languages, including Rust and Swift, but also languages such as Go and OCaml that are also making inroads into programming practice.  (The overwhelming success and future dominance of Haskell is self-evident.  Kudos!) But for now, let me say that the golden age of programming language research is here and now, and promises to continue indefinitely.

Update: word smithing.

## Parallelism and Concurrency, Revisited

April 9, 2014

I still get compliments on and criticisms of my post from three years ago (can it possibly be that long?) on parallelism and concurrency.  In that post I offered a “top down” argument to the effect that these are different abstractions with different goals: parallelism is about exploiting computational resources to maximize efficiency, concurrency is about non-deterministic composition of components in a system.  Parallelism never introduces bugs (the semantics is identical to the sequential execution), but concurrency could be said to be the mother lode of all bugs (the semantics of a component changes drastically, without careful provision, when composed concurrently with other components).  From this point of view the two concepts aren’t comparable, yet relatively few people seem to accept the distinction, or, even if they do, do not accept the terminology.

Here I’m going to try a possible explanation of why the two concepts, which seem separable to me, may seem inseparable to others.

I think that it is to do with scheduling.

One view of parallelism is that it’s just talk for concurrency, because all you do when you’re programming in parallel is fork off some threads, and then do something with their results when they’re done.  I’ve previously argued that parallelism is about cost, but let’s leave that aside.  It’s unarguable that a parallel computation does consist of a bunch of, well, parallel computations, and so it is about concurrency.  I’ve previously argued that that’s not a good way to think about concurrency either, but let’s leave that aside as well.  So, the story goes, concurrency and parallelism are synonymous, and people like me are just creating confusion.

Perhaps that is true, but here’s why it may not be a good idea to think of parallelism this way.  Scheduling as you learned about it in OS class (for example) is a altogether different than scheduling for parallelism.  There are two aspects of OS-like scheduling that I think are relevant here.  First, it is non-deterministic, and second, it is competitive.  Non-deterministic, because you have little or no control over what runs when or for how long.  A beast like the Linux scheduler is controlled by a zillion “voodoo parameters” (a turn of phrase borrowed from my queueing theory colleague, Mor Harchol-Balter), and who the hell knows what is going to happen to your poor threads once they’re in its clutches.  Second, and more importantly, an OS-like scheduler is allocating resources competitively.  You’ve got your threads, I’ve got my threads, and we both want ours to get run as soon as possible.  We’ll even pay for the privilege (priorities) if necessary.  The scheduler, and the queueing theory behind it is designed to optimize resource usage on a competitive basis, taking account of quality of service guarantees purchased by the participants.  It does not matter whether there is one processor or one thousand processors, the schedule is unpredictable.  That’s what makes concurrent programming hard: you have to program against all possible schedules.  And that’s why it’s hard to prove much about the time or space complexity of your program when it’s implemented concurrently.

Parallel scheduling is a whole ‘nother ball of wax.  It is (usually, but not necessarily) deterministic, so that you can prove bounds on its efficiency (Brent-type theorems, as discussed in a previous post and in PFPL).  And, more importantly, it is cooperative in the sense that all threads are working together for the same computation towards the same ends.  The threads are scheduled so as to get the job (there’s only one) done as quickly and as efficiently as possible.  Deterministic schedulers for parallelism are the most common, because they are the easiest to analyze with respect to their time and space bounds.  Greedy schedulers, which guarantee to maximize use of available processors, never leaving any idle when there is work to be done, form an important class for which the simple form of Brent’s Theorem is obvious.

Many deterministic greedy scheduling algorithms are known, of which I will mention p-DFS and p-BFS, which do p-at-a-time depth- and breadth-first search of the dependency graph, and various forms of work-stealing schedulers, pioneered by Charles Leiserson at MIT.  (Incidentally, if you don’t already know what p-DFS or p-BFS are, I’ll warn you that they are a little trickier than they sound.  In particular p-DFS uses a data structure that is sort of like a stack but is not a stack.)  These differ significantly in their time bounds (for example, work stealing usually involves expectation over a random variable, whereas the depth- and breadth-first traversals do not), and differ dramatically in their space complexity.  For example, p-BFS is absolutely dreadful in its space complexity.  (For a full discussion of these issues in parallel scheduling, I recommend Dan Spoonhower’s PhD Dissertation.  His semantic profiling diagrams are amazingly beautiful and informative!)

So here’s the thing: when you’re programming in parallel, you don’t just throw some threads at some non-deterministic competitive scheduler.  Rather, you generate an implicit dependency graph that a cooperative scheduler uses to maximize efficiency, end-to-end.  At the high level you do an asymptotic cost analysis without considering platform parameters such as the number of processors or the nature of the interconnect.  At the low level the implementation has to validate that cost analysis by using clever techniques to ensure that, once the platform parameters are known, maximum use is made of the computational resources to get your job done for you as fast as possible.  Not only are there no bugs introduced by the mere fact of being scheduled in parallel, but even better, you can prove a theorem that tells you how fast your program is going to run on a real platform.  Now how cool is that?

[Update: word-smithing.]

[Update: more word-smithing for clarity and concision.]

## Old Neglected Theorems Are Still Theorems

March 20, 2014

I have very recently been thinking about the question of partiality vs totality in programming languages, a perennial topic in PL’s that every generation thinks it discovers for itself.  And this got me to remembering an old theorem that, it seems, hardly anyone knows ever existed in the first place.  What I like about the theorem is that it says something specific and technically accurate about the sizes of programs in total languages compared to those in partial languages.  The theorem provides some context for discussion that does not just amount to opinion or attitude (and attitude alway seems to abound when this topic arises).

The advantage of a total programming language such as Goedel’s T is that it ensures, by type checking, that every program terminates, and that every function is total. There is simply no way to have a well-typed program that goes into an infinite loop. This may seem appealing, until one considers that the upper bound on the time to termination can be quite large, so large that some terminating programs might just as well diverge as far as we humans are concerned. But never mind that, let us grant that it is a virtue of  T that it precludes divergence.

Why, then, bother with a language such as PCF that does not rule out divergence? After all, infinite loops are invariably bugs, so why not rule them out by type checking? (Don’t be fooled by glib arguments about useful programs, such as operating systems, that “run forever”. After all, infinite streams are programmable in the language M of inductive and coinductive types in which all functions terminate. Computing infinitely does not mean running forever, it just means “for as long as one wishes, without bound.”)  The notion does seem appealing until one actually tries to write a program in a language such as T.

Consider computing the greatest common divisor (GCD) of two natural numbers. This can be easily programmed in PCF by solving the following equations using general recursion:

$\begin{array}{rcl} \textit{gcd}(m,0) & = & m \\ \textit{gcd}(0,m) & = & m \\ \textit{gcd}(m,n) & = & \textit{gcd}(m-n,n) \quad \text{if}\ m>n \\ \textit{gcd}(m,n) & = & \textit{gcd}(m,n-m) \quad \text{if}\ m

The type of $\textit{gcd}$ defined in this manner has partial function type $(\mathbb{N}\times \mathbb{N})\rightharpoonup \mathbb{N}$, which suggests that it may not terminate for some inputs. But we may prove by induction on the sum of the pair of arguments that it is, in fact, a total function.

Now consider programming this function in T. It is, in fact, programmable using only primitive recursion, but the code to do it is rather painful (try it!). One way to see the problem is that in T the only form of looping is one that reduces a natural number by one on each recursive call; it is not (directly) possible to make a recursive call on a smaller number other than the immediate predecessor. In fact one may code up more general patterns of terminating recursion using only primitive recursion as a primitive, but if you examine the details, you will see that doing so comes at a significant price in performance and program complexity. Program complexity can be mitigated by building libraries that codify standard patterns of reasoning whose cost of development should be amortized over all programs, not just one in particular. But there is still the problem of performance. Indeed, the encoding of more general forms of recursion into primitive recursion means that, deep within the encoding, there must be “timer” that “goes down by ones” to ensure that the program terminates. The result will be that programs written with such libraries will not be nearly as fast as they ought to be.  (It is actually quite fun to derive “course of values” recursion from primitive recursion, and then to observe with horror what is actually going on, computationally, when using this derived notion.)

But, one may argue, T is simply not a serious language. A more serious total programming language would admit sophisticated patterns of control without performance penalty. Indeed, one could easily envision representing the natural numbers in binary, rather than unary, and allowing recursive calls to be made by halving to achieve logarithmic complexity. This is surely possible, as are numerous other such techniques. Could we not then have a practical language that rules out divergence?

We can, but at a cost.  One limitation of total programming languages is that they are not universal: you cannot write an interpreter for T within T (see Chapter 9 of PFPL for a proof).  More importantly, this limitation extends to any total language whatever.  If this limitation does not seem important, then consider the Blum Size Theorem (BST) (from 1967), which places a very different limitation on total languages.  Fix any total language, L, that permits writing functions on the natural numbers. Pick any blowup factor, say $2^{2^n}$, or however expansive you wish to be.  The BST states that there is a total function on the natural numbers that is programmable in L, but whose shortest program in L is larger by the given blowup factor than its shortest program in PCF!

The underlying idea of the proof is that in a total language the proof of termination of a program must be baked into the code itself, whereas in a partial language the termination proof is an external verification condition left to the programmer. Roughly speaking, there are, and always will be, programs whose termination proof is rather complicated to express, if you fix in advance the means by which it may be proved total. (In T it was primitive recursion, but one can be more ambitious, yet still get caught by the BST.)  But if you leave room for ingenuity, then programs can be short, precisely because they do not have to embed the proof of their termination in their own running code.

There are ways around the BST, of course, and I am not saying otherwise.  For example, the BST merely guarantees the existence of a bad case, so one can always argue that such a case will never arise in practice.  Could be, but I did mention the GCD in T problem for a reason: there are natural problems that are difficult to express in a language such as T.  By fixing the possible termination arguments in advance, one is tempting fate, for there are many problems, such as the Collatz Conjecture, for which the termination proof of a very simple piece of code has been an open problem for decades, and has resisted at least some serious attempts on it.  One could argue that such a function is of no practical use.  I agree, but I point out the example not to say that it is useful, but to say that it is likely that its eventual termination proof will be quite nasty, and that this will have to be reflected in the program itself if you are limited to a T-like language (rendering it, once again, useless).  For another example, there is no inherent reason why termination need be assured by means similar to that used in T.  We got around this issue in NuPRL by separating the code from the proof, using a type theory based on a partial programming language, not a total one.  The proof of termination is still required for typing in the core theory (but not in the theory with “bar types” for embracing partiality).  But it’s not baked into the code itself, affecting its run-time; it is “off to the side”, large though it may be).

Updates: word smithing, fixed bad link, corrected gcd, removed erroneous parenthetical reference to Coq, fixed LaTeX problems.

## Exceptions Are Shared Secrets

December 3, 2012

Exceptions are commonly criticized as being the “goto’s” of modern programming languages.  Raising an exception transfers control to an unknown destination, it is said, and this is a bad thing on engineering grounds.  I disagree.  It is perfectly predictable where a raised exception will be handled, provided that exceptions are done properly, which, unfortunately, is not always the case.

The crucial point that exception values are shared secrets. Let us distinguish two parties, the raiser of the exception, and the handler of it. The fundamental idea of exceptions is to transfer a value from the raiser to the handler without the possibility of interception by another party. While the language of secrecy seems appropriately evocative, I hasten to add that I am not here concerned with “attackers” or suchlike, but merely with the difficulties of ensuring modular composition of programs from components. In such a setting the “attacker” is yourself, who is not malicious, but who is fallible.

By raising an exception the raiser is “contacting” a handler with a message. The raiser wishes to limit which components of a program may intercept that message. More precisely, the raiser wishes to ensure that only certain previously agreed-upon components may handle that exception, perhaps only one. This property should remain stable under extension to the program or composition with any other component. It should not be possible for an innocent third party to accidentally intercept a message that was not intended for it.

Achieving this requires a secrecy mechanism that allows the raiser and the handler(s) to agree upon their cooperation. This is accomplished by dynamic classification, exactly as it is done properly in Standard ML (but not O’Caml). The idea is that the raiser has access to a dynamically generated constructor for exception values, and any handler has access to the corresponding dynamically generated matcher for exception values. This means that the handler, and only the handler, can decode the message sent by the raiser; no other party can do anything with it other than pass it along unexamined. It is “perfectly encrypted” and cannot be deciphered by any unintended component.

The usual exception mechanisms, as distinct from exception values, allow for “wild-card handlers”, which means that an exception can be intercepted by a third party. This means that the raiser cannot ensure that the handler actually receives the message, but it can ensure, using dynamic classification, that only a legitimate handler may decipher it. Decades of experience with Standard ML shows that this is a very useful thing indeed, and has application far beyond just the simple example considered here. For full details, see my forthcoming book, for a full discussion of dynamic classification and its role for ensuring integrity and confidentiality in a program. Dynamic classification is not just for “security”, but is rather a good tool for everyday programming.

Haskell is one language that does not get exceptions right.  Allocating an exception incurs a storage effect, and so would have to be confined to the IO monad in Haskell.  But this would destroy the utility of exceptions in pure code.  The result is an exception mechanism that is arguably broken and that does not provide the guarantees that make exceptions a perfectly pleasant and useful way to write code.

Update: Reworked last paragraph to clarify the point I am making; the previous formulation appears to have invited misinterpretation.

Update: This account of exceptions also makes clear why the perennial suggestion to put exception-raising information into types makes no sense to me. I will write more about this in a future post, but meanwhile contemplate that a computation may raise an exception that is not even in principle nameable in the type. That is, it is not conservativity that’s at issue, it’s the very idea.

Update: Wordsmithing, removal of irrelevant remarks to focus on the main point about dynamic exceptions.

## Practical Foundations for Programming Languages Is Out

December 3, 2012

Practical Foundations for Programming Languages, published by Cambridge University Press, is now available in print! It can be ordered from the usual sources, and maybe some unusual ones as well. If you order directly from Cambridge using this link, you will get a 20% discount on the cover price (pass it on).

Since going to press I have, inevitably, been informed of some (so far minor) errors that are corrected in the online edition. These corrections will make their way into the second printing. If you see something fishy-looking, compare it with the online edition first to see whether I may have already corrected the mistake. Otherwise, send your comments to me.

By the way, the cover artwork is by Scott Draves, a former student in my group, who is now a professional artist as well as a researcher at Google in NYC. Thanks, Scott!

Update: The very first author’s copy hit my desk today!

## Yet Another Reason Not To Be Lazy Or Imperative

August 26, 2012

In an earlier post I argued that, contrary to much of the literature in the area, parallelism is all about efficiency, and has little or nothing to do with concurrency.  Concurrency is concerned with controlling non-determinism, which can arise in all sorts of situations having nothing to do with parallelism.  Process calculi, for example, are best viewed as expressing substructural composition of programs, and have very little to do with parallel computing.  (See my PFPL and Rob Simmons’ forthcoming Ph.D. dissertation for more on this perspective.)  Parallelism, on the other hand, is captured by analyzing the cost of a computation whose meaning is independent of its parallel execution.  A cost semantics specifies the abstract cost of a program that is validated by a provable implementation that transfers the abstract cost to a precise concrete cost on a particular platform.  The cost of parallel execution is neatly captured by the concept of a cost graph that captures the dynamic data dependencies among subcomputations.  Details such as the number of processors or the nature of the interconnect are factored into the provable implementation, which predicts the asymptotic behavior of a program on a hardware platform based on its cost graph.  One advantage of cost semantics for parallelism is that it is easy to teach freshmen how to write parallel programs; we’ve been doing this successfully for two years now, with little fuss or bother.

This summer Guy Blelloch and I began thinking about other characterizations of the complexity of programs besides the familiar abstractions of execution time and space requirements of a computation.  One important measure, introduced by Jeff Vitter, is called I/O Complexity.  It measures the efficiency of algorithms with respect to memory traffic, a very significant determiner of performance of programs.  The model is sufficiently abstract as to encompass several different interpretations of I/O complexity.  Basically, the model assumes an unbounded main memory in which all data is ultimately stored, and considers a cache of $M=c\times B$ blocked into chunks of size $B$ that provides quick access to main memory.  The complexity of algorithms is analyzed in terms of these parameters, under the assumption that in-cache accesses are cost-free, so that the only significant costs are those incurred by loading and flushing the cache.  You may interpret the abstract concepts of main memory and cache in the standard way as a two-level hierarchy representing, say, on- and off-chip memory access, or instead as representing a disk (or other storage medium) loaded into memory for processing.  The point is that the relative costs of processing cached versus uncached data is huge, and worth considering as a measure of the efficiency of an algorithm.

As usual in the algorithms world Vitter makes use of a low-level machine model in which to express and evaluate algorithms.  Using this model Vitter obtains a lower-bound for sorting in the I/O model, and a matching upper bound using a $k$-way merge sort, where $k$ is chosen as a function of $M$ and $B$ (that is, it is not cache oblivious in the sense of Leiserson, et al.)  Although such models provide a concrete, and well-understood, basis for analyzing algorithms, we all know full well that programming at such a low-level is at best a tedious exercise.  Worse, machine models provide no foundation for composition of programs, the single most important characteristic of higher-level language models.  (Indeed, the purpose of types is to mediate composition of components; without types, you’re toast.)

The point of Guy’s and my work this summer is to adapt the I/O model to functional programs, avoiding the mess, bother, and futility of trying to work at the machine level.  You might think that it would be impossible to reason about the cache complexity of a functional program (especially if you’re under the impression that functional programming necessarily has something to do with Haskell, which it does not, though you may perhaps say that Haskell has something to do with functional programming).  Traditional algorithms work, particularly as concerns cache complexity, is extremely finicky about memory management in order to ensure that reasonable bounds are met, and you might reasonably suspect that it will ever be thus.  The point of our paper, however, is to show that the same asymptotic bounds obtained by Vitter in the I/O model may be met using purely functional programming, provided that the functional language is (a) non-lazy (of course), and (b) implemented properly (as we describe).

Specifically, we give a cost semantics for functional programs (in the paper, a fragment of ML) that takes account of the memory traffic engendered by evaluation, and a provable implementation that validates the cost semantics by describing how to implement it on a machine-like model.  The crux of the matter is to account for the cache effects that arise from maintaining a control stack during evaluation, even though the abstract semantics has no concept of a stack (it’s part of the implementation, and cannot be avoided).  The cost semantics makes explicit the reading and allocation of values in the store (using Felleisen, Morrisett, and H’s “Abstract Models of Memory Management”), and imposes enough structure on the store to capture the critical concept of locality that is required to ensure good cache (or I/O) behavior.  The abstract model is parameterized by $M$ and $B$ described above, but interpreted as representing the number of objects in the cache and the neighborhood of an object in memory (the objects that are allocated near it, and that are therefore fetched along with the object whenever the cache is loaded).

The provable implementation is given in two steps.  First, we show how to transfer the abstract cost assigned to a computation into the amount of memory traffic incurred on an abstract machine with an explicit control stack.  The key idea here is an amortization argument that allows us to obtain tight bounds on the overhead required to maintain the stack.  Second, we show how to implement the crucial read and allocate operations that underpin the abstract semantics and the abstract machine.  Here we rely on a competitive analysis, given by Sleator, et al., of the ideal cache model, and on an amortization of the cost of garbage collection in the style of Appel.  We also make use of an original (as far as I know) technique for implementing the control stack so as to avoid unnecessary interference with the data objects in cache.  The net result is that the cost semantics provides an accurate asymptotic analysis of the I/O complexity of a functional algorithm, provided that it is implemented in the manner we describe in the paper (which, in fact, is not far from standard implementation techniques, the only trick being how to manage the control stack properly).  We then use the model to derive bounds for several algorithms that are comparable to those obtained by Vitter using a low-level machine model.

The upshot of all of this is that we can reason about the I/O or cache complexity of functional algorithms, much as we can reason about the parallel complexity of functional algorithms, namely by using a cost semantics.  There is no need to drop down to a low-level machine model to get a handle on this important performance metric for your programs, provided, of course, that you’re not stuck with a lazy language (for those poor souls, there is no hope).

Follow