The Point of Laziness

April 24, 2011

As I’ve discussed previously, there are a number of good reasons why Haskell is not suitable for teaching introductory functional programming.  Chief among these is laziness, which in the context of a pure functional language has fatal side effects.  First, Haskell suffers from a paucity of types.  It is not possible in Haskell to define the type of natural numbers, nor the type of lists of natural numbers (or lists of anything else), nor any other inductive type!  (In Carollian style there are types called naturals and lists, but that’s only what they’re called, it’s not what they are.)  Second, the language has problematic cost model.  It is monumentally difficult to reason about the time, and especially space, usage of a Haskell program.  Worse, parallelism arises naturally in an eager, not a lazy, language—for example, computing every element of a finite sequence is fundamental to parallel computing, yet is not compatible with the ideology of laziness, which specifies that we should only compute those elements that are required later.

The arguments in favor of laziness never seem convincing to me.  One claim is that the equational theory of lazy programs is said to be more convenient; for example, beta reduction holds without restriction.  But this is significant only insofar as you ignore the other types in the language.  As Andrzej Filinski pointed out decades ago, whereas lazy languages have products, but not sums, eager languages have sums, but not products.  Take your pick.  Similarly, where lazy languages rely on strictness conditions, eager languages rely on totality conditions.  The costs and benefits are dual, and there seems to be no reason to insist a priori on one set of equations as being more important than the other.

Another claim is that laziness supports the definition of infinite data types, such as infinite sequences of values of some type.  But laziness is not essential, or even particularly useful, for this purpose.  For example, the type nat->nat is a natural representation of infinite sequences of natural numbers that supports many, but not all, of the operations that finite sequences (but not, for example, operations such a reverse, which make no sense in the infinite case).   More generally, there is no inherent connection between laziness and such infinitary types.  Noam Zeilberger has developed an elegant theory of eager and lazy types based on distinguishing positive from negative polarities of type constructors, the positive including the inductive and the negative including the coinductive.   Coinductive types are no more about laziness than inductive types are about pointers.

I wish to argue that laziness is important, but not for pure functional programming, but rather only in conjunction with effects.  This is the Kahn-MacQueen Principle introduced in the 1970’s by Gilles Kahn and David MacQueen in their seminal paper on recursive networks of stream transducers.  Dan Licata and I have emphasized this viewpoint in our lectures on laziness in our new course on functional programming for freshmen.

Let’s use streams as a motivating example, contrasting them with lists, with which they are confused in lazy languages.  A list is an example of a positive type, one that is defined by its membership conditions (constructors).  Defining a function on a list amounts to pattern matching, giving one case for each constructor (nil and cons), and using recursion to apply the function to the tail of the list.  A stream is an example of a negative type, one that is defined by its behavioral conditions (destructors).  Defining a stream amounts to defining how it behaves when its head and tail are computed.  The crucial thing about lists, or any positive type, is that they are colimits; we know as part of their semantics how a value of list type are constructed.  The crucial thing about streams, or any negative type, is that they are limits; we know as part of their semantics how they behave when destructed.

Since we have no access to the “inside” of a stream, we should think of it not as a static data structure, but as a dynamic process that produces, upon request, successive elements of the stream.  Internally, the stream keeps track of whatever is necessary to determine successive outputs; it has its own state that is not otherwise visible from the outside.  But if a stream is to be thought of as given by a process of generation, then it is inherently an ephemeral data structure.  Interacting with a stream changes its state; the “old” stream is lost when the “new” stream is created.  But, as we have discussed previously, ephemeral data structures are of limited utility.  The role of memoization is to transform an ephemeral process into a persistent data structure by recording the successive values produced by the process so that they can be “replayed” as necessary to permit the stream to have multiple futures.  Thus, rather than being a matter of efficiency, memoization is a matter of functionality, providing a persistent interface to an underlying ephemeral process.

To see how this works in practice, let’s review the signatures PROCESS and STREAM that Dan Licata and I developed for our class.  Here’s a snippet of the signature of processes:

signature PROCESS = sig
  type 'a process = unit -> 'a option
  val stdin : char process
  val random : real process

A process is a function that, when applied, generates a value of some type, or indicates that it is finished.  The process stdin represents the Unix standard input; the process random is a random number generator.  The signature of streams looks essentially like this:

signature STREAM = sig
  type 'a stream
  datatype 'a front = Nil | Cons of 'a * 'a stream
  val expose : 'a stream -> 'a front
  val memo : 'a Process.process -> 'a stream
  val fix : ('a stream -> 'a stream) -> 'a stream

The type ‘a front is the type of values that arise when a stream is exposed; it can either terminate, or present an element and another stream.  The memo constructor creates a persistent stream from an ephemeral process of creation for its elements.  The fix operation is used to create recursive networks of streams.  There are other operations as well, but these illustrate the essence of the abstraction.

Using these signatures as a basis, it is extremely easy to put together a package of routines for scripting.  The fundamental components are processes that generate the elements of a stream.  Combinators on streams, such as composition or mapping and reducing, are readily definable, and may be deployed to build up higher levels of abstraction.  For example, Unix utilities, such as grep, are stream transducers that take streams as inputs and produce streams as outputs.  These utilities do not perform input/output; they merely transform streams.  Moreover, since streams are persistent, there is never any issue with “buffering” or “lookahead” or “backtracking”; you just manipulate the stream like any other (persistent) data structure, and everything works automagically.  The classical Bell Labs style of intermixing I/O with processing is eliminated, leading not only to cleaner code, but also greater flexibility and re-use.  This is achieved not by the double-backflips required by the inheritance mechanisms of oopl’s, but rather by making a crisp semantic distinction between the processing of streams and the streaming of processes.  True reuse operates at the level of abstractions, not at the level of the code that gives rise to them.

Update: It seems worthwhile to point out that memoization to create a persistent from an ephemeral data structure is a premier example of a benign effect, the use of state to evince functional behavior.  But benign effects are not programmable in Haskell, because of the segregation of effects into the IO monad.

Update: Lennart Augustsson gives his reasons for liking laziness.

Update: word smithing.


Some Thoughts on Teaching FP

April 17, 2011

A number of people have asked some very good questions about the details of how we teach certain concepts in our new functional programming class for freshmen at Carnegie Mellon.  Rather than spray responses among the various comments, I’ll summarize a few major points here in hopes of helping others who may wish to teach a similar course.  So this post is not really meant for a broad audience, but rather for the specialist; feel free to skip it if it seems too focused for your interests.  I promise to write some more controversial material of more general interest soon!  Meanwhile, here are a few thoughts presented in no particular order of importance.

Because the class is intended for beginners, we start from scratch with a language-based model of computation.  This means that, with one regrettable misstep on our part, we never talk about extra-linguistic concepts like “run-time stacks” or “compilers.”  The students are taught to think in terms of the language itself, and to reason directly about both the correctness and efficiency of the code they actually write, not the code that it allegedly compiles to or translates to.  One beautiful feature of the language-based approach is that we start with a very familiar model of computation, the evaluation of polynomials over the reals.  It’s very familiar for all students, and I think they find it very satisfying precisely because it has a direct computational flavor.  You plug in for variables and simplify, and out comes the answer.  We can draw on this as our starting point; programs are just generalized polynomials.  In particular, in a functional language variables are variables: the mathematical concept of variables, which is given meaning by substitution, is precisely the programming concept of variable.  It’s not analogous, it’s the same.  So we can draw on their experience and ask them to prove things about programs using methods that build directly on what they already know.  It’s extremely natural, and very beautiful, and leads easily to an integrated view of mathematics and programming.  Moreover, it’s a level playing field.  Students with prior “programming experience” are, if anything, at a disadvantage, because they think they know things that are either wrong or inapplicable.  One consequence is gender equity, because even with American society being what it is, the women have no particular disadvantage with respect to the men when it comes to our style of thinking and programming.  It’s a win-win-win situation.

Moving to a more technical level, the use of structural operational semantics is indispensable for providing a rigorous foundation for understanding program execution, reasoning about program correctness, and for defining a cost model to support reasoning about asymptotic complexity.  There is no substitute for this!  Without a crisp formulation of the semantics of the language, it is impossible to discuss any of these issues in a meaningful and technically precise way.  With it you can readily resolve any questions about “what happens if …”, giving the students a tool that they can use themselves to answer such questions.  Moreover, as program verification becomes more important in industrial practice, as well as academic research, it is essential that students be educated in the tools of semantics.  Structural operational semantics is very easy to teach, and presents no problems for the students.  We just use it, and they get it without any fuss or bother.  It is a natural extension of their experience with high school algebra.  Be not afraid of using these tools to teach programming!

As I’ve explained previously, it is a very good idea to avoid Booleans as much as possible.  And, above all, don’t mention equality!  The equals sign in programming languages is not the equals sign of mathematics.  Propositions are not Booleans, and it only confuses matters to use notations that encourage this misconception.  Related to this, avoid if-then-else entirely, and instead use only case analysis for branching, even when the value to be discriminated is a Boolean.  We consistently write things like

case,y) of
  LESS => ...
| GREATER => ...
| EQUAL => ...

rather than a nested conditional branch.  It encourages students to think in terms of pattern matching, and prepares the ground for later developments, including a smooth transition to pattern matching over more complex data structures and reasoning inductively when programming recursively.

Teaching parallelism is completely straightforward, because the model of computation inherently avoids unnatural and unnecessary problems of interference, and focuses attention on the critical issue of data dependencies among computations in a program.  Students have no trouble computing the work (sequential time complexity) or span (parallel time complexity) of a program, and have no problems reading off recurrences for the respective time complexities.  Later, when we introduce sequences, the idea of computing in parallel with the entire sequence, rather than item-by-iterm (as encouraged by the dreadful iterators so beloved in the oo world), comes naturally and easily.  The key to this, of course, is that data structures in a functional language are naturally persistent; it is monstrously hard to use ephemeral data structures in a parallel computation, and is not something we could hope to teach freshmen.

A major decision for us is how to teach the expression and enforcement of abstraction in a program.  In a departure from our previous approach, we have decided against using opaque ascription (sealing) as a means of enforcing abstraction.  It has its virtues, but the problem is that it does not mesh well with other language features, in particular with substructures and type classes (views).  For example, consider the signature of a mapping whose domain is an ordered type of keys:

signature MAPPING = sig
  structure Key : ORDERED
  type 'a mapping
  val lookup : Key.t -> 'a mapping -> 'a

Unfortunately, sealing a structure with this signature renders the module useless:

structure IntMapping :> MAPPING = struct
 structure Key = Int
 type 'a mapping = 'a bst

The trouble is that not only is the type ‘a IntMapping.mapping abstract, as intended, but so is IntMapping.Key.t, which is not at all intended!  To get around this we we must create a specialization of the signature MAPPING using one of several means such as

signature INT_MAPPING = MAPPING where type Key.t=int
structure IntMapping :> INT_MAPPING = ...

Of course one need not name the signature, but this illustrates the general problem.  As things get more complicated, you have more and more clauses that specify the types of things (sharing specifications).

The alternative, which has worked very well for us, is to eschew opaque ascription, and instead rely on the datatype mechanism to make types abstract.  So to give an implementation of the abstract type of mappings with keys being integers, we proceed as follows:

structure IntMapping : MAPPING = struct
  structure Key : ORDERED = Int
  datatype 'a bst = Empty | Node of 'a bst * (Key.t * 'a) * 'a bst
  type 'a mapping = 'a bst
  val insert = ...

The point is that since the constructors of the type ‘a bst are not exported in the interface, the type ‘a IntMapping.mapping is abstract.  Note as well that the use of transparent ascription on the structure Key ensures that keys really are integers (of type, and are not abstract, exactly as intended.  This formulation allows us to state simple rules of signature matching (every specification in the signature has a corresponding declaration in the structure), and allows us to enforce abstraction boundaries with a minimum of fuss.  The students have had absolutely no trouble with this at all, and we have had no trouble structuring our code this way.

When using functors (parameterized modules) to combine modules it is, of course, necessary to impose sharing constraints to ensure that only coherent compositions are possible.  (Rather than take the space to explain this here, I will rely on the reader’s experience to understand what is involved here.)  These sorts of sharing specifications are perfectly natural, easily explained, and have presented absolutely no difficulties for the students.  We illustrated their use in our game tree search example, in which the “referee” module is parameterized by the two “player” modules, which must of course cohere on their concept of a “game” (it’s no use pitting a chess player against a checkers player!).  The code looks like this

functor Referee
  (structure Player1 : PLAYER and Player2 : PLAYER
   sharing type Player1.Game.t = Player2.Game.t) : REFEREE = ...

The sharing specification states precisely and concisely the natural coherence constraint governing the two players.  Here again, the dog we feared never barked, and the students found it all quite intuitive and unproblematic.  This allowed them to expend their time on the actual complexities of the problem at hand, such as how to think about alpha-beta pruning in a parallel game-tree search, rather than get all tied up with the bureaucracy of structuring the code itself.

The virtue of teaching bright young students is that they are bright and they are young.  Their brilliance is, of course, a pleasure.  We have to work hard to come up with sufficiently challenging exercises, and many students challenge us with their solutions to our problems.  Their youth means that they come to us with a minimum of misconceptions and misinformation that they’ve picked up on the street, and are open to learning methods that are entirely non-standard (at least for now) with respect to what their friends are learning at other universities.  What makes Carnegie Mellon a special place is precisely that the students are pushed into thinking hard in ways that they might not be.  Personally, I hope that more universities worldwide build on what we have started, and give their students the same benefits that ours are already enjoying.

Modules Matter Most

April 16, 2011

When it comes to controlling the complexity of developing and, more importantly, maintaining a large system, the only game in town is modularity.  And as even the strongest proponents of unityped languages have come to accept, modularity is all about types (static types, of course, there being no other kind).  A decomposition of a system into modules consists of an application of the structural principle of substitution (transitivity of entailment, composition of maps) that is fundamental to the very conception of a type system:

\displaystyle{{\Gamma\vdash M : A \qquad \Gamma,x:A \vdash N:B}\over{\Gamma\vdash [M/x]N:B}}

In pedestrian terms the type A is the “header file” describing M, and N is the new code implementing specification B in terms of an unspecified implementation of the functionality specified by A.  The interaction between M and N is mediated entirely by the type A; access to the source code of M is denied to the client, precisely so that the dependence between the components is weakened in anticipation of future changes to M, or to allow for flexibility in development (M need not even exist for the development of N to proceed, as long as the type A is specified).

To be most useful, it is important that the relationship between M and A be many-to-many, and not many-to-one or one-to-one.  Think of A as the API for a mapping between keys and values.  It is absolutely essential that the language admit that many different modules M be of type A, and it is absolutely essential that a given module M satisfy many distinct types A, without prior arrangement.  The type A is purely descriptive of the extensional behavior of a module of that type, and cannot be regarded as pinning down any particular implementation of that behavior.  Moreover, a given module may well support a variety of views, for example by “forgetting” certain aspects of it in particular contexts.  For example, one may neglect that an implementation of mapping supports deletion in a setting where only extension and application are required.

This is all pretty basic, but what surprises me is how few languages support it cleanly and simply.  One particularly awkward method is to rely on extra-linguistic “tools” that manipulate build parameters to switch choices of M for a given N, quickly resulting in an ad hoc language of its own just to manage the build scripts.  Another technique is to manipulate or restrict inheritance so that some degree of modularity can be simulated, much as one can bang in nails using a pair of pliers.  A common methodology, in fact, attempts to cut down inheritance to provide what we had in ML in the early 1980’s (functors), obviating the intervening decades of maladaptations of bad ideas.

More disappointingly, for me at least, is that even relatively enlightened languages, such as Haskell or F#, fail to support this basic machinery.  In Haskell you have type classes, which are unaccountably popular (perhaps because it’s the first thing many people learn).  There are two fundamental problems with type classes.  The first is that they insist that a type can implement a type class in exactly one way.  For example, according to the philosophy of type classes, the integers can be ordered in precisely one way (the usual ordering), but obviously there are many orderings (say, by divisibility) of interest.  The second is that they confound two separate issues: specifying how a type implements a type class and specifying when such a specification should be used during type inference.  As a consequence, using type classes is, in Greg Morrisett’s term, like steering the Queen Mary: you have to get this hulking mass pointed in the right direction so that the inference mechanism resolves things the way you want it to.  In F# the designers started with the right thing (Caml) and eliminated the very thing that matters the most about ML, it’s module system!  Instead the F# designers added a bunch of object-oriented concepts (for the sake of compatibility with .net and with the mindset of MS developers), and tricked up the language with features that are more readily, and flexibly, provided by the module system.

Why bring this up?  Apart from general grousing, my point is that we had little choice in what language to use in teaching our students.  Modularity matters most, and we must have a language that supports flexible modularity in the form I am describing here.  When we examined our options, which we did very carefully, the only contenders are Standard ML and O’Caml.  We could have gone with either, but were persuaded to use Standard ML, which has worked beautifully for our purposes.  The decisive factor in choosing between the two ML’s was simply that we have a prior code base in Standard ML on which to draw, and there are two implementations of Standard ML that support parallelism (MLton and Poly/ML), albeit neither optimally for our purposes.  Haskell provides better support for parallelism (by undoing its unfortunate commitment to laziness, which results in an unusable cost model for both time and, especially, space), but wasn’t suitable because of its lack of support for modularity.

As I have mentioned, our plan is to re-do the introductory curriculum in Computer Science to modernize it and to place better emphasis on principles rather than current technologies.  One aspect of this is to re-do the standard Data Structures and Algorithms course to eliminate the over-emphasis on ephemeral data structures, and to treat parallel algorithms as the general case that encompasses the increasingly irrelevant case of one processor.  (Yes, this is a technological trend, but it is more importantly a conceptual change that emerges from focusing on language, rather than machine, models of computation.)  What is a data structure?  It’s a signature, or interface, written in the language you’re programming in.  What’s an algorithm?  It’s a structure, or implementation, of that signature.  A signature, such as that for a mapping, can be implemented in various ways, with different cost trade-offs (logarithmic lookup vs. constant lookup, for example).  A given algorithm, such as a balanced tree, can implement many different data structures, such as mappings or sets.  The distinction between a peristent and an ephemeral mapping is a difference of data structure, that is, of signature.  The demands are different, the algorithms are different.  We should be able to support both forms as easily and cleanly as the other, to be able to compare them, and to explain, for example, why the ephemeral case is of limited utility.  It is not too much to ask to be able to write these examples as running code, with a minimum of fuss or bother!

We have for decades struggled with using object-oriented languages, such as Java or C++, to explain these simple ideas, and have consistently failed.  And I can tell those of you who are not plugged into academics at the moment, many of my colleagues world-wide are in the same situation, and are desperate to find a way out.  The awkward methodology, the “design patterns”, the “style guidelines”, all get in the way of teaching the principles.  And even setting that aside, you’re still doing imperative programming on ephemeral data structures.  It just does not work, because it is fundamentally the wrong thing.  Just try to teach, say, binary search tree delete; it’s a horrific mess!  You wind up with absurd “null pointer” nonsense, and a complex mess caused by the methodology, not the problem itself.  Pretty soon you have to resort to “frameworks” and “tools” just to give the students a fighting chance to get anything done at all, distancing them from the essential ideas and giving the impression that programming is painful and ugly, an enormous tragedy.

Our solution is to abandon it entirely, pushing OO techniques to a more advanced level for students who wish to learn them, and concentrating on the basic principles of modularity, parallel and sequential cost analysis, and direct verification of running code at the introductory level.  Dijkstra used to say “beauty is our business”, to which I would add that life is too short, and bright minds too precious, to waste on ugly things.  And if you take this point of view seriously, the best and the brightest are drawn to, rather than repulsed by, the field.  Pace some of my colleagues at a Major Institute of Technology, students absolutely do love to program and do love the beauty of code, provided that the code you ask them to write is, in fact, beautiful.  There is little more dreary than the corporate bureaucracy of OOP, and few things more lovely than the mathematical elegance of FP.

[Update: word-smithing, again]

A Dead Dog

April 12, 2011

In an earlier post I wrote about our approach to teaching students to reason inductively when programming recursively.  This generated some discussion, because teaching induction and recursion is notoriously difficult, to the point that many of my colleagues have all but given up hope of ever getting these ideas across to the majority of their students.  I’ve already explained our methods, and suggested some reasons why we seem to have succeeded where others have not.

We’re nearing the end of semester, and I must say that I am so proud of our students, I just have to brag a bit.  First, some background.  As I’ve mentioned, I am co-teaching a new course in functional programming with Dan Licata this semester as part of a thorough revamping of introductory CS that places a strong emphasis on reasoning about the correctness and efficiency of programs in both a sequential and parallel setting.  Dan and I have worked closely with an amazing team of teaching assistants in developing the new curriculum and course materials.  For the sake of continuity, Dan does all the lecturing, and I sit in the audience, sometimes kibbitzing a bit, but mostly just paying attention to the signals that students are sending and thinking about what we could do better next time.  Despite it being a moderately large class of about 83 students, we have managed to maintain an interactive atmosphere in which students freely ask, and respond, to questions, sometimes working in pairs for a few minutes on their own during the lecture.  This has worked very well for everyone, creating a conducive environment for learning to write beautiful code.

Today’s lecture was about persistent and ephemeral data structures.  One goal was simply to draw the distinction, and show how it is expressed in the interface of an abstract type; it is good to have some terminology available for comparing functional and imperative programming styles.  Another goal was to point out the importance of persistence for parallelism, since a parallel computation inherently requires that a data structure have “multiple futures”, rather than the “single future” provided by the ephemeral case.  (Put another way, the exclusive consideration of ephemeral representations in the usual data structures course taught worldwide looks more and more misguided as time goes on.  Indeed, our planned new approach to teaching data structures reverses the emphasis.  This is one reason why I consider object-oriented languages to be unsuitable for a modern introductory curriculum, conventional wisdom notwithstanding.)  A third goal was to re-acquaint the students with imperative programming so that they could make direct comparison with what they have been learning all semester, and recognize the limitations of imperative methods when it comes to parallelism and distribution.  Simply put, imperative methods are useful, if at all, in the “corner case” of sequential execution of programs acting on ephemeral data structures; in all other cases what you want are functional (transformational) methods.

But enough about context.  What made me feel so proud today was the sophistication displayed by our students during today’s interactive program development.  The exercise was to implement binary search tree delete in the persistent (transformational) style, and in the ephemeral (imperative) style in two different ways.  Dan set up the problem, and asked the students to help him write the code.  What impressed me so deeply was that as I listened to the murmur in the classroom, and to the proposals offered aloud, I could hear students saying things like

Well, inductively, we can assume that the problem has been solved for the left subtree, so to continue we need only ….

Outstanding!  (And remember, these are freshmen!)  Just as I had hoped, for these students thinking inductively comes naturally as the obvious way to think about a piece of recursive code!  They know instinctively how to formulate an inductive argument for the correctness of program, rather than resort to error-prone hand-waving about “going around and around the loop” that is all too typical at this level.  What makes this possible is that functional programs are inherently mathematical, allowing the students to concentrate on the ideas, rather than on the intricacies of coding in less expressive imperative languages.  I think that this gives them the confidence to tackle some quite difficult problems, such as the Barnes-Hut n-body algorithm or Jamboree game-search algorithm, for a one-week homework assignment.  Material that used to be complicated, or deferred to more advanced courses, can be done readily with beginners if you have the right framework in which to express the ideas.

As I walked back to my office, a group of students on the elevator with me groaned about the pain of having to do imperative programming again (most had learned the old ways last semester), but cheered me up by saying that doing so made a good case for why functional methods are better.  What more could one ask?

Persistence of Memory

April 9, 2011

Salvador Dali’s masterpiece, The Persistence of Memory, is one of the most recognizable works of art in the world.  The critic Dawn Ades described it as “a Surrealist meditation on the collapse of our notions of a fixed cosmic order” induced by Einstein’s penetrating analysis of the concepts of time and space in the physical world.  Just as Dali’s Persistence of Memory demarcated the transition from age-old conceptions of time and space in physics, so does the computational concept of persistence of memory mark the transition from sequential time and mutable space to parallel time and immutable space.

A short while ago I described the distinction between parallelism and concurrency in terms of a cost model that assigns a parallel, as well as sequential, time complexity to a program.  Parallelism is all about efficiency, not semantics; the meaning of a program is not affected by whether it is executed on one processor or many.  Functional languages expose parallelism by limiting sequential dependencies among the parts of a computation; imperative languages introduce inessential dependencies that impede parallelism.

Another critical ingredient for parallelism is the concept of a persistent data structure, one for which its associated operations transform, rather than alter, it.  A persistent dictionary, for example, has the characteristic that inserting an element results in a new dictionary with an expanded domain; the old dictionary persists, and is still available for further transformation.  When calculating the sum of 2 and 2, resulting in 4, no one imagines that the 2’s are “used up” in the process!  Nor does one worry whether the sum of 1 and 3 is the “same” 4 or a “different” 4!  The very question is absurd (or, more precisely, trivial).  So why do we worry about whether the result of inserting 2 into dict “uses up” the old dict?  And why do we worry about whether inserting 2 into the empty dictionary twice results in the “same” dictionary or a “different” one?

Yet both academic and practical computing has all but confined itself to ephemeral data structures, which exhibit precisely such behavior.  Operations on an ephemeral data structure “use up” the data structure, making it unavailable for further use without going to some trouble to get it back.  The pathologies resulting from this abound.  Standard compiler texts, for example, devote a chapter to concepts like “block structured symbol tables”, which are, in fact, nothing more than persistent dictionaries done the hard way.  More generally, whenever a data structure has multiple futures, such as when backtracking or exploiting parallelism, ephemeral data structures get in the way.  Indeed,  the bulk of object-oriented programming, with its absurd over-emphasis on the “message passing” metaphor, stress the alteration of objects as the central organizing principle, confounding parallelism and complicating simple algorithms.

A prime virtue of functional languages is that persistence is the default case, but they can as readily support ephemeral data structures as any imperative (including object-oriented) language.  All functional languages include types of mutable cells and mutable arrays, and provide support for conventional, sequential, imperative programming with semicolons and even curly braces!  (Some do this better than others; Haskell is, in my view, the world’s best imperative programming language, and second-best functional language, but that’s a subject for another post.)  But why would you want to? Why deprive yourself of the benefits of persistence, and insist instead on an ephemeral data structure?

This question came up recently in our planning for the Functional Programming class that we are teaching this semester for freshmen at Carnegie Mellon.  All semester we have been using functional programming techniques to build clean, verifiable, modular, parallel programs.  The students routinely prove theorems about their code, structure programs using abstract types, and exploit parallelism to improve the asymptotic performance of their programs.  Recent homework assignments include the implementation of the parallel Barnes-Hut algorithm for the n-body problem in physics, and the parallel Jamboree algorithm for game-tree search in perfect information games.  Persistent data structures are the key to making this possible; just try to code Barnes-Hut in an imperative language, and you will find yourself in a morass worrying about concurrency when you should instead be envisioning a recursive tree decomposition of space, and the computation of forces using formulas from high school physics.

We tried hard to find good motivations for using an ephemeral data structure when you can just as easily (actually, much more easily) use a persistent one.  As we went through them, we realized that all of the standard arguments are questionable or false.  The usual one is some vague notion of “efficiency” in either time or space.  While I concede that one can, in principle, solve a particular, confined problem more efficiently by doing absolutely everything by hand (memory management, scheduling, arithmetic), in the overwhelming majority of cases the demands of evolution of code far outweigh the potential advantages of doing everything by hand.  Modularity matters most when it comes to building and evolving large systems; functional languages, with persistent data structures, support modularity the best.  (I’ll have more to say about this in a future post.)

Most of the arguments about efficiency, though, ignore questions of functionality.  It is senseless to compare the “efficiency” of one data structure that provides different functionality than another.  A persistent data structure does more for you than does an ephemeral one.  It allows you to have multiple futures, including those that evolve in parallel with one another.  It makes no sense to insist that some ephemeral approximation of such a data structure is “more efficient” if it does not provide those capabilities!  And making it do so is, invariably, a bitch!  Conventional ephemeral data structures are not readily parallelizable; it’s often a publishable result to get a decent degree of parallelism using imperative methods.  By contrast, even freshmen (admittedly, Carnegie Mellon freshmen) can implement a parallel game tree search or a tree-based solution to the n-body problem in a one-week homework assignment.

So what’s the deal?  Why should we care about ephemeral data structures at all?  I have no idea.  Mostly, it’s a legacy cost imposed on us by the overzealous emphasis on imperative methods and machine-based, rather than language-based, models of computation.  But this will change.  Starting this fall, introductory data structures and algorithms will be liberated from the limitations of imperative, object-oriented programming, and will instead stress persistent (as well as ephemeral) data structures, and parallel (including as a special case sequential) algorithms.  The future of computing depends on parallelism (for efficiency), distribution (for scale), and verification (for quality).  Only functional languages support all three naturally and conveniently; the old ones just don’t cut it.

Update: Here’s a chart summarizing the situation as I see it:

\displaystyle  \begin{array}{|c|c|c|}  \hline  & \text{Ephemeral} & \text{Persistent} \\  \hline  \text{Sequential} & \textit{imperative} & \textit{benign effects} \\  \hline  \text{Parallel} & \textit{hard to get right} & \textit{functional} \\  \hline  \end{array}

Conventional imperative programming works well for the ephemeral, sequential case; it is notoriously hard to use imperative methods for parallelism.  Benign effects, as exemplified by self-adjusting data structures, can be used to give rise to persistent behavior in the sequential setting, but the use of effects impedes parallelism.

Functions Are Values

April 2, 2011

After the midterm the course staff for the Functional Programming class had a T-shirt made up saying “Functions Are Values”.  What prompted this is a curious blind spot shared by nearly every student in the class about the nature of functions in ML (or, for that matter, in math).  The students performed admirably on the midterm examination, capably writing higher-order functions, proving correctness of programs by structural induction, and even using higher-order functions to express staged computation.  And yet nearly every student got the following simple “gimme” question wrong:

What is the type and value of the expression “fn x => raise Fail”?

I sensed trouble when, during the examination, a student asked me to clarify the question for him.  Needless to say, I was at a loss for words!  Sure enough, all but one student got this simple question wrong, sometimes spectacularly.

Many said “it has no value” and offered a guess as to its type.  Others said “it raises an exception” and therefore has type “exn”.  (For those who don’t know ML, the type exn is the type of values associated with an exception.)  Still others said “the compiler will notice that it always raises an exception, and will therefore reject it”, or words to that effect.  Where the hell did they get that bizarre notion?  Whatever the deficiencies of our teaching may be, we certainly never said anything close to that!  Others got the type right, but still could not explain what is its value.

Given that they are clearly capable of writing higher-order functional programs, how could this happen?  Obviously the fault is not with our students, but with ourselves.  But what did we do wrong?  How did we manage to stump so many students with what we thought was a free 3-pointer?  To be honest, I don’t know.  But I have a few theories that I thought I would air here in hopes of helping others avoid my mistakes, or nip misunderstandings in the bud.

Throughout the course we have been very careful to develop a rigorous language-based model of computation using structural operational semantics.  We assign meaning to the program you write, not to its translation into some mysterious underlying machine that doesn’t really exist anyway (in other words, we avoided telling the conventional lies).  Nowhere did we ever explain anything by reference to “the compiler”, except at one point, which I now realize was a cardinal mistake.  Don’t make it yourself.  Here’s how it happened.  We meant well, but the spirit is weak, and sometimes we stray.  Forgive me Father, for I have sinned…..

Here’s where we went wrong, and I think invited wild speculation that led to the mistakes on the examination.  One cool use of higher-order functions is to stage computation.  A good example is provided by a continuation-based regular expression matcher, which I will not explain in any detail here.  The crucial point is that it has the type

regexp -> string -> (string -> bool) -> bool

which says that it accepts a regular expression, a string, and a continuation, matching an initial segment of the string, if possible, and passing the rest to the continuation, or returning false if not.  As this description suggests we can think of the matcher as a curried form of a three-argument function, and that’s that.  But a more interesting formulation stages the computation so that, given the regexp, it computes a matcher for that regular expression that may be applied repeatedly to many different candidate strings without reprocessing the regular expression.  (The code for this is quite beautiful; it’s available on our course web site if you’re interested.)

All this can be explained quite neatly using operational semantics, showing how the recursion over the regexp is completed, yielding a matching function as result.  It’s all very cool, except for one little thing: the result contains numerous beta redices that can be simplified to obtain a clearer and simpler representation of the matching code.  Since the equations involved are all so self-evident, we stated (and here’s our mistake) that “the compiler can simplify these away to obtain the following code”, which was, of course, much clearer.  What we said was perfectly true, but unfortunately it opened the floodgates of incomprehension.  The trouble is, the students have no idea what a compiler is or how it works, so for them it’s a “magic wand” that somehow manages to get their ML code executed on an x86.  And we invited them to speculate on what this mystical compiler does, and thereby (I conjecture) invited them to think that (a) what the compiler does is paramount for understanding anything, and (b) the compiler does whatever they wish to imagine it does.  Somehow the students read our simple three-pointer as a “trick question” that was supposed to involve some mumbo-jumbo about the compiler, and that’s what we got, alas.

So, my first bit of advice is, don’t mention the compiler as an explanation for anything!  They’ll have plenty of chances later in their careers to learn about compilers, and to understand how semantics informs compilation, what is valid equational reasoning, and so forth.  But for freshmen, stick with the model.  Otherwise you’re courting disaster.  (I’ve already mentioned other situations where mentioning “the compiler” only muddies the waters and confuses students, the teaching of recursion being a prime example.  I intend to discuss others in future posts.)

Now that I’ve taken the blame for my mistakes, I feel less guilty about shifting some of it elsewhere.  I have two thoughts about why students resist the idea of functions being values of no fundamentally different character than any others.  One source, no doubt, is that many of them have “learned” programming on the street, and have developed all sorts of misconceptions that are being applied here.  Most popular languages make it hard to handle functions as values: you have to name them, you have to go to special trouble to pass them as arguments or return them as results, they are called “methods” and have a strange semantics involving state, and on and on.  From this students acquire the sense that functions are “special”, and cannot be thought of like other things.

Another source of misunderstanding is that elementary mathematics (up through freshman year, at least) stresses the pointful style, always speaking of f(x), rather than f itself, and carrying this forward through calculus, writing d f(x) / dx for the derivative, confusing the function with its values, and so forth.  Here again the message is clear: functions are “special” and are not to be treated on the same footing as numbers.  It’s a pity, because I think that the point-free style is clearer and more natural, if non-standard.  The differential is an operator acting on functions that produces a linear approximation to the function at each point in its domain (if the function in fact has such).  Be that as it may, the consequence of the pointful style is that students develop early on a “fear of functions” that is hard to break.