Parallelism and Concurrency, Revisited

April 9, 2014

I still get compliments on and criticisms of my post from three years ago (can it possibly be that long?) on parallelism and concurrency.  In that post I offered a “top down” argument to the effect that these are different abstractions with different goals: parallelism is about exploiting computational resources to maximize efficiency, concurrency is about non-deterministic composition of components in a system.  Parallelism never introduces bugs (the semantics is identical to the sequential execution), but concurrency could be said to be the mother lode of all bugs (the semantics of a component changes drastically, without careful provision, when composed concurrently with other components).  From this point of view the two concepts aren’t comparable, yet relatively few people seem to accept the distinction, or, even if they do, do not accept the terminology.

Here I’m going to try a possible explanation of why the two concepts, which seem separable to me, may seem inseparable to others.

I think that it is to do with scheduling.

One view of parallelism is that it’s just talk for concurrency, because all you do when you’re programming in parallel is fork off some threads, and then do something with their results when they’re done.  I’ve previously argued that parallelism is about cost, but let’s leave that aside.  It’s unarguable that a parallel computation does consist of a bunch of, well, parallel computations, and so it is about concurrency.  I’ve previously argued that that’s not a good way to think about concurrency either, but let’s leave that aside as well.  So, the story goes, concurrency and parallelism are synonymous, and people like me are just creating confusion.

Perhaps that is true, but here’s why it may not be a good idea to think of parallelism this way.  Scheduling as you learned about it in OS class (for example) is a altogether different than scheduling for parallelism.  There are two aspects of OS-like scheduling that I think are relevant here.  First, it is non-deterministic, and second, it is competitive.  Non-deterministic, because you have little or no control over what runs when or for how long.  A beast like the Linux scheduler is controlled by a zillion “voodoo parameters” (a turn of phrase borrowed from my queueing theory colleague, Mor Harchol-Balter), and who the hell knows what is going to happen to your poor threads once they’re in its clutches.  Second, and more importantly, an OS-like scheduler is allocating resources competitively.  You’ve got your threads, I’ve got my threads, and we both want ours to get run as soon as possible.  We’ll even pay for the privilege (priorities) if necessary.  The scheduler, and the queueing theory behind it is designed to optimize resource usage on a competitive basis, taking account of quality of service guarantees purchased by the participants.  It does not matter whether there is one processor or one thousand processors, the schedule is unpredictable.  That’s what makes concurrent programming hard: you have to program against all possible schedules.  And that’s why it’s hard to prove much about the time or space complexity of your program when it’s implemented concurrently.

Parallel scheduling is a whole ‘nother ball of wax.  It is (usually, but not necessarily) deterministic, so that you can prove bounds on its efficiency (Brent-type theorems, as discussed in a previous post and in PFPL).  And, more importantly, it is cooperative in the sense that all threads are working together for the same computation towards the same ends.  The threads are scheduled so as to get the job (there’s only one) done as quickly and as efficiently as possible.  Deterministic schedulers for parallelism are the most common, because they are the easiest to analyze with respect to their time and space bounds.  Greedy schedulers, which guarantee to maximize use of available processors, never leaving any idle when there is work to be done, form an important class for which the simple form of Brent’s Theorem is obvious.

Many deterministic greedy scheduling algorithms are known, of which I will mention p-DFS and p-BFS, which do p-at-a-time depth- and breadth-first search of the dependency graph, and various forms of work-stealing schedulers, pioneered by Charles Leiserson at MIT.  (Incidentally, if you don’t already know what p-DFS or p-BFS are, I’ll warn you that they are a little trickier than they sound.  In particular p-DFS uses a data structure that is sort of like a stack but is not a stack.)  These differ significantly in their time bounds (for example, work stealing usually involves expectation over a random variable, whereas the depth- and breadth-first traversals do not), and differ dramatically in their space complexity.  For example, p-BFS is absolutely dreadful in its space complexity.  (For a full discussion of these issues in parallel scheduling, I recommend Dan Spoonhower’s PhD Dissertation.  His semantic profiling diagrams are amazingly beautiful and informative!)

So here’s the thing: when you’re programming in parallel, you don’t just throw some threads at some non-deterministic competitive scheduler.  Rather, you generate an implicit dependency graph that a cooperative scheduler uses to maximize efficiency, end-to-end.  At the high level you do an asymptotic cost analysis without considering platform parameters such as the number of processors or the nature of the interconnect.  At the low level the implementation has to validate that cost analysis by using clever techniques to ensure that, once the platform parameters are known, maximum use is made of the computational resources to get your job done for you as fast as possible.  Not only are there no bugs introduced by the mere fact of being scheduled in parallel, but even better, you can prove a theorem that tells you how fast your program is going to run on a real platform.  Now how cool is that?

[Update: word-smithing.]

[Update: more word-smithing for clarity and concision.]


Intro Curriculum Update

August 17, 2012

In previous posts I have talked about the new introductory CS curriculum under development at Carnegie Mellon. After a year or so of planning, we began to roll out the new curriculum in the Spring of 2011, and have by now completed the transition. As mentioned previously, the main purpose is to bring the introductory sequence up to date, with particular emphasis on introducing parallelism and verification. A secondary purpose was to restore the focus on computing fundamentals, and correct the drift towards complex application frameworks that offer the students little sense of what is really going on. (The poster child was a star student who admitted that, although she had built a web crawler the previous semester, she in fact has no idea how to build a web crawler.) A particular problem is that what should have been a grounding in the fundamentals of algorithms and data structures turned into an exercise in object-oriented programming, swamping the core content with piles of methodology of dubious value to beginning students. (There is a new, separate, upper-division course on oo methodology for students interested in this topic.) A third purpose was to level the playing field, so that students who had learned about programming on the street were equally as challenged, if not more so, than students without much or any such experience. One consequence would be to reduce the concomitant bias against women entering CS, many fewer of whom having prior computing experience than the men.

The solution was a complete do-over, jettisoning the traditional course completely, and starting from scratch. The most important decision was to emphasize functional programming right from the start, and to build on this foundation for teaching data structures and algorithms. Not only does FP provide a much more natural starting point for teaching programming, it is infinitely more amenable to rigorous verification, and provides a natural model for parallel computation. Every student comes to university knowing some algebra, and they are therefore familiar with the idea of computing by calculation (after all, the word algebra derives from the Arabic al jabr, meaning system of calculation). Functional programming is a generalization of algebra, with a richer variety of data structures and a richer set of primitives, so we can build on that foundation. It is critically important that variables in FP are, in fact, mathematical variables, and not some distorted facsimile thereof, so all of their mathematical intuitions are directly applicable. So we can immediately begin discussing verification as a natural part of programming, using principles such as mathematical induction and equational reasoning to guide their thinking. Moreover, there are natural concepts of sequential time complexity, given by the number of steps required to calculate an answer, and parallel time complexity, given by the data dependencies in a computation (often made manifest by the occurrences of variables). These central concepts are introduced in the first week, and amplified throughout the semester.

Two major homework exercises embody the high points of the first-semester course, one to do with co-development of code with proof, the other to do with parallelism. Co-development of program and proof is illustrated by an online regular expression matcher. The problem is a gem for several reasons. One is that it is essentially impossible for anyone to solve by debugging a blank screen. This sets us up nicely for explaining the importance of specification and proof as part of the development process. Another is that it is very easy, almost inevitable, for students to make mistakes that are not easily caught or diagnosed by testing. We require the students to carry out a proof of the correctness of the matcher, and in the process discover a point where the proof breaks down, which then leads to a proper solution. (See my JFP paper “Proof-Directed Debugging” for a detailed development of the main ideas.) The matcher also provides a very nice lesson in the use of higher-order functions to capture patterns of control, resulting in an extremely clean and simple matcher whose correctness proof is highly non-trivial.

The main parallelism example is the Barnes-Hut algorithm for solving the n-body problem in physics. Barnes-Hut is an example of a “tree-based” method, invented by Andrew Appel, for solving this well-known problem. At a high level the main idea is to decompose space into octants (or quadrants if you’re working in the plane), recursively solving the problem for each octant and then combining the solutions to make an overall solution. The key idea is to use an approximation for bodies that are far enough away—a distant constellation can be regarded as an atomic body for the purposes of calculating the effects of its stars on the sun, say. The problem is naturally parallelizable, because of the recursive decomposition. Moreover, it provides a very nice link to their high school physics. Since FP is just an extension of mathematics, the equations specifying the force law and Newton’s Law carry over directly to the code. This is an important sense in which FP builds on and amplifies their prior mathematical experience, and shows how one may connect computer science with other sciences in a rather direct manner.

The introductory FP course establishes the basis for the new data structures and algorithms course that most students take in either the third or fourth semester. This course represents a radical departure from tradition in several ways. First, it is a highly rigorous course in algorithms that rivals the upper-division course taught at most universities (including our own) for the depth and breadth of ideas it develops. Second, it takes the stance that all algorithms are parallel algorithms, with sequential being but a special case of parallel. Of course some algorithms have a better parallelizability ratio (a precise technical characterization of the ability to make use of parallelism), and this can be greatly affected by data structure selection, among other considerations. Third, the emphasis is on persistent abstract types, which are indispensable for parallel computing. No more linked lists, no more null pointer nonsense, no more mutation. Instead we consider abstract types of graphs, trees, heaps, and so forth, all with a persistent semantics (operations create “new” ones, rather than modify “old” ones). Fourth, we build on the reasoning methods introduced in the first semester course to prove the correctness and the efficiency of algorithms. Functional programming makes all of this possible. Programs are concise and amenable to proof, they are naturally parallel, and they enjoy powerful typing properties that enforce abstraction in a mathematically rigorous manner. Fifth, there is a strong emphasis on problems of practical interest. For example, homework 1 is the shotgun method for genome sequencing, a parallel algorithm of considerable practical importance and renown.

There is a third introductory course in imperative programming, taken in either the first or second semester (alternating with the functional programming course at the student’s discretion), that teaches the “old ways” of doing things using a “safe” subset of C. Personally, I think this style of programming is obsolete, but there are many good reasons to continue to teach it, the biggest probably being the legacy problem. The emphasis is on verification, using simple assertions that are checked at run-time and which may be verified statically in some situations. It is here that students learn how to do things “the hard way” using low-level representations. This course is very much in the style of the 1970’s era data structures course, the main difference being that the current incarnation of Pascal has curly braces, rather than begin-end.

For the sake of those readers who may not be up to date on such topics, it seems important to emphasize that functional programming subsumes imperative programming. Every functional language is capable of expressing the old-fashioned sequential pointer-hacking implementation of data structures. You can even reproduce Tony Hoare’s self-described “billion dollar mistake” of the cursed “null pointer” if you’d like! But the whole point is that it is rarely useful, and almost never elegant, to work that way. (Curiously, the “monad mania” in the Haskell community stresses an imperative, sequential style of programming that is incompatible with parallelism, but this is not forced on you as it is in the imperative world.) From this point of view there no loss, and considerable gain, in teaching functional programming from the start. It puts a proper emphasis on mathematically sane programming methods, but allows for mutation-based programming where appropriate (for example, in engendering “side effects” on users).

To learn more about these courses, please see the course web pages:

I encourage readers to review the syllabi and course materials. There is quite a large body of material in place that we plan to expand into textbook-level treatments in the near future. We also plan to write a journal article documenting our experiences with these courses.

I am very grateful to my colleagues Guy Blelloch, Dan Licata, and Frank Pfenning for their efforts in helping to develop the new introductory curriculum at Carnegie Mellon, and to the teaching assistants whose hard work and dedication put the ideas into practice.

Update: Unfortunately, the homework assignments for these courses are not publicly available, because we want to minimize the temptation for students to make use of old assignments and solutions in preparing their own work.  I am working with my colleagues to find some way in which we can promote the ideas without sacrificing too much of the integrity of the course materials.  I apologize for the inconvenience.

Some Thoughts on Teaching FP

April 17, 2011

A number of people have asked some very good questions about the details of how we teach certain concepts in our new functional programming class for freshmen at Carnegie Mellon.  Rather than spray responses among the various comments, I’ll summarize a few major points here in hopes of helping others who may wish to teach a similar course.  So this post is not really meant for a broad audience, but rather for the specialist; feel free to skip it if it seems too focused for your interests.  I promise to write some more controversial material of more general interest soon!  Meanwhile, here are a few thoughts presented in no particular order of importance.

Because the class is intended for beginners, we start from scratch with a language-based model of computation.  This means that, with one regrettable misstep on our part, we never talk about extra-linguistic concepts like “run-time stacks” or “compilers.”  The students are taught to think in terms of the language itself, and to reason directly about both the correctness and efficiency of the code they actually write, not the code that it allegedly compiles to or translates to.  One beautiful feature of the language-based approach is that we start with a very familiar model of computation, the evaluation of polynomials over the reals.  It’s very familiar for all students, and I think they find it very satisfying precisely because it has a direct computational flavor.  You plug in for variables and simplify, and out comes the answer.  We can draw on this as our starting point; programs are just generalized polynomials.  In particular, in a functional language variables are variables: the mathematical concept of variables, which is given meaning by substitution, is precisely the programming concept of variable.  It’s not analogous, it’s the same.  So we can draw on their experience and ask them to prove things about programs using methods that build directly on what they already know.  It’s extremely natural, and very beautiful, and leads easily to an integrated view of mathematics and programming.  Moreover, it’s a level playing field.  Students with prior “programming experience” are, if anything, at a disadvantage, because they think they know things that are either wrong or inapplicable.  One consequence is gender equity, because even with American society being what it is, the women have no particular disadvantage with respect to the men when it comes to our style of thinking and programming.  It’s a win-win-win situation.

Moving to a more technical level, the use of structural operational semantics is indispensable for providing a rigorous foundation for understanding program execution, reasoning about program correctness, and for defining a cost model to support reasoning about asymptotic complexity.  There is no substitute for this!  Without a crisp formulation of the semantics of the language, it is impossible to discuss any of these issues in a meaningful and technically precise way.  With it you can readily resolve any questions about “what happens if …”, giving the students a tool that they can use themselves to answer such questions.  Moreover, as program verification becomes more important in industrial practice, as well as academic research, it is essential that students be educated in the tools of semantics.  Structural operational semantics is very easy to teach, and presents no problems for the students.  We just use it, and they get it without any fuss or bother.  It is a natural extension of their experience with high school algebra.  Be not afraid of using these tools to teach programming!

As I’ve explained previously, it is a very good idea to avoid Booleans as much as possible.  And, above all, don’t mention equality!  The equals sign in programming languages is not the equals sign of mathematics.  Propositions are not Booleans, and it only confuses matters to use notations that encourage this misconception.  Related to this, avoid if-then-else entirely, and instead use only case analysis for branching, even when the value to be discriminated is a Boolean.  We consistently write things like

case,y) of
  LESS => ...
| GREATER => ...
| EQUAL => ...

rather than a nested conditional branch.  It encourages students to think in terms of pattern matching, and prepares the ground for later developments, including a smooth transition to pattern matching over more complex data structures and reasoning inductively when programming recursively.

Teaching parallelism is completely straightforward, because the model of computation inherently avoids unnatural and unnecessary problems of interference, and focuses attention on the critical issue of data dependencies among computations in a program.  Students have no trouble computing the work (sequential time complexity) or span (parallel time complexity) of a program, and have no problems reading off recurrences for the respective time complexities.  Later, when we introduce sequences, the idea of computing in parallel with the entire sequence, rather than item-by-iterm (as encouraged by the dreadful iterators so beloved in the oo world), comes naturally and easily.  The key to this, of course, is that data structures in a functional language are naturally persistent; it is monstrously hard to use ephemeral data structures in a parallel computation, and is not something we could hope to teach freshmen.

A major decision for us is how to teach the expression and enforcement of abstraction in a program.  In a departure from our previous approach, we have decided against using opaque ascription (sealing) as a means of enforcing abstraction.  It has its virtues, but the problem is that it does not mesh well with other language features, in particular with substructures and type classes (views).  For example, consider the signature of a mapping whose domain is an ordered type of keys:

signature MAPPING = sig
  structure Key : ORDERED
  type 'a mapping
  val lookup : Key.t -> 'a mapping -> 'a

Unfortunately, sealing a structure with this signature renders the module useless:

structure IntMapping :> MAPPING = struct
 structure Key = Int
 type 'a mapping = 'a bst

The trouble is that not only is the type ‘a IntMapping.mapping abstract, as intended, but so is IntMapping.Key.t, which is not at all intended!  To get around this we we must create a specialization of the signature MAPPING using one of several means such as

signature INT_MAPPING = MAPPING where type Key.t=int
structure IntMapping :> INT_MAPPING = ...

Of course one need not name the signature, but this illustrates the general problem.  As things get more complicated, you have more and more clauses that specify the types of things (sharing specifications).

The alternative, which has worked very well for us, is to eschew opaque ascription, and instead rely on the datatype mechanism to make types abstract.  So to give an implementation of the abstract type of mappings with keys being integers, we proceed as follows:

structure IntMapping : MAPPING = struct
  structure Key : ORDERED = Int
  datatype 'a bst = Empty | Node of 'a bst * (Key.t * 'a) * 'a bst
  type 'a mapping = 'a bst
  val insert = ...

The point is that since the constructors of the type ‘a bst are not exported in the interface, the type ‘a IntMapping.mapping is abstract.  Note as well that the use of transparent ascription on the structure Key ensures that keys really are integers (of type, and are not abstract, exactly as intended.  This formulation allows us to state simple rules of signature matching (every specification in the signature has a corresponding declaration in the structure), and allows us to enforce abstraction boundaries with a minimum of fuss.  The students have had absolutely no trouble with this at all, and we have had no trouble structuring our code this way.

When using functors (parameterized modules) to combine modules it is, of course, necessary to impose sharing constraints to ensure that only coherent compositions are possible.  (Rather than take the space to explain this here, I will rely on the reader’s experience to understand what is involved here.)  These sorts of sharing specifications are perfectly natural, easily explained, and have presented absolutely no difficulties for the students.  We illustrated their use in our game tree search example, in which the “referee” module is parameterized by the two “player” modules, which must of course cohere on their concept of a “game” (it’s no use pitting a chess player against a checkers player!).  The code looks like this

functor Referee
  (structure Player1 : PLAYER and Player2 : PLAYER
   sharing type Player1.Game.t = Player2.Game.t) : REFEREE = ...

The sharing specification states precisely and concisely the natural coherence constraint governing the two players.  Here again, the dog we feared never barked, and the students found it all quite intuitive and unproblematic.  This allowed them to expend their time on the actual complexities of the problem at hand, such as how to think about alpha-beta pruning in a parallel game-tree search, rather than get all tied up with the bureaucracy of structuring the code itself.

The virtue of teaching bright young students is that they are bright and they are young.  Their brilliance is, of course, a pleasure.  We have to work hard to come up with sufficiently challenging exercises, and many students challenge us with their solutions to our problems.  Their youth means that they come to us with a minimum of misconceptions and misinformation that they’ve picked up on the street, and are open to learning methods that are entirely non-standard (at least for now) with respect to what their friends are learning at other universities.  What makes Carnegie Mellon a special place is precisely that the students are pushed into thinking hard in ways that they might not be.  Personally, I hope that more universities worldwide build on what we have started, and give their students the same benefits that ours are already enjoying.

A Dead Dog

April 12, 2011

In an earlier post I wrote about our approach to teaching students to reason inductively when programming recursively.  This generated some discussion, because teaching induction and recursion is notoriously difficult, to the point that many of my colleagues have all but given up hope of ever getting these ideas across to the majority of their students.  I’ve already explained our methods, and suggested some reasons why we seem to have succeeded where others have not.

We’re nearing the end of semester, and I must say that I am so proud of our students, I just have to brag a bit.  First, some background.  As I’ve mentioned, I am co-teaching a new course in functional programming with Dan Licata this semester as part of a thorough revamping of introductory CS that places a strong emphasis on reasoning about the correctness and efficiency of programs in both a sequential and parallel setting.  Dan and I have worked closely with an amazing team of teaching assistants in developing the new curriculum and course materials.  For the sake of continuity, Dan does all the lecturing, and I sit in the audience, sometimes kibbitzing a bit, but mostly just paying attention to the signals that students are sending and thinking about what we could do better next time.  Despite it being a moderately large class of about 83 students, we have managed to maintain an interactive atmosphere in which students freely ask, and respond, to questions, sometimes working in pairs for a few minutes on their own during the lecture.  This has worked very well for everyone, creating a conducive environment for learning to write beautiful code.

Today’s lecture was about persistent and ephemeral data structures.  One goal was simply to draw the distinction, and show how it is expressed in the interface of an abstract type; it is good to have some terminology available for comparing functional and imperative programming styles.  Another goal was to point out the importance of persistence for parallelism, since a parallel computation inherently requires that a data structure have “multiple futures”, rather than the “single future” provided by the ephemeral case.  (Put another way, the exclusive consideration of ephemeral representations in the usual data structures course taught worldwide looks more and more misguided as time goes on.  Indeed, our planned new approach to teaching data structures reverses the emphasis.  This is one reason why I consider object-oriented languages to be unsuitable for a modern introductory curriculum, conventional wisdom notwithstanding.)  A third goal was to re-acquaint the students with imperative programming so that they could make direct comparison with what they have been learning all semester, and recognize the limitations of imperative methods when it comes to parallelism and distribution.  Simply put, imperative methods are useful, if at all, in the “corner case” of sequential execution of programs acting on ephemeral data structures; in all other cases what you want are functional (transformational) methods.

But enough about context.  What made me feel so proud today was the sophistication displayed by our students during today’s interactive program development.  The exercise was to implement binary search tree delete in the persistent (transformational) style, and in the ephemeral (imperative) style in two different ways.  Dan set up the problem, and asked the students to help him write the code.  What impressed me so deeply was that as I listened to the murmur in the classroom, and to the proposals offered aloud, I could hear students saying things like

Well, inductively, we can assume that the problem has been solved for the left subtree, so to continue we need only ….

Outstanding!  (And remember, these are freshmen!)  Just as I had hoped, for these students thinking inductively comes naturally as the obvious way to think about a piece of recursive code!  They know instinctively how to formulate an inductive argument for the correctness of program, rather than resort to error-prone hand-waving about “going around and around the loop” that is all too typical at this level.  What makes this possible is that functional programs are inherently mathematical, allowing the students to concentrate on the ideas, rather than on the intricacies of coding in less expressive imperative languages.  I think that this gives them the confidence to tackle some quite difficult problems, such as the Barnes-Hut n-body algorithm or Jamboree game-search algorithm, for a one-week homework assignment.  Material that used to be complicated, or deferred to more advanced courses, can be done readily with beginners if you have the right framework in which to express the ideas.

As I walked back to my office, a group of students on the elevator with me groaned about the pain of having to do imperative programming again (most had learned the old ways last semester), but cheered me up by saying that doing so made a good case for why functional methods are better.  What more could one ask?

Parallelism Is Not Concurrency

March 17, 2011

In an earlier post I mentioned that one goal of the new introductory curriculum at Carnegie Mellon is to teach parallelism as the general case of computing, rather than an esoteric, specialized subject for advanced students.  Many people are incredulous when I tell them this, because it immediately conjures in their mind the myriad complexities of concurrency: locks, monitors, deadlock, dining philosophers, ….  And if you look at the recent research literature, you will see countless papers of the form (a) parallelism is important, so (b) let’s talk about concurrency.  The situation has gotten so bad that I think that most computer scientists cannot distinguish the two concepts.  (Or maybe they willfully confuse the two so as to justify funding for work on concurrency by appealing to the importance of parallelism.)

Given this state of affairs, I cannot explain what we are doing at Carnegie Mellon (to anyone who doesn’t already know) without first rebooting you.  Trust me, I’m just going to hit Ctrl-Alt-Delete, and then click Restart.  OK, we’re back.

The first thing to understand is parallelism has nothing to do with concurrency.  Concurrency is concerned with nondeterministic composition of programs (or their components).  Parallelism is concerned with asymptotic efficiency of programs with deterministic behavior.  Concurrency is all about managing the unmanageable: events arrive for reasons beyond our control, and we must respond to them.  A user clicks a mouse, the window manager must respond, even though the display is demanding attention.  Such situations are inherently nondeterministic, but we also employ pro forma nondeterminism in a deterministic setting by pretending that components signal events in an arbitrary order, and that we must respond to them as they arise.  Nondeterministic composition is a powerful program structuring idea.  Parallelism, on the other hand, is all about dependencies among the subcomputations of a deterministic computation.  The result is not in doubt, but there are many means of achieving it, some more efficient than others.  We wish to exploit those opportunities to our advantage.

Now I can hear you object, but isn’t concurrency required to implement parallelism?  Well, yes, it is, but concurrency is also required to implement sequentiality too!  The timing signal on your processor chip is essentially a synchronization mechanism with which to coordinate the otherwise independent activity of the components of the processor.  (Yes, I know that there are asynchronous processors, but this only makes my point stronger, you still need to control the concurrency.)  The point is that concurrency is not relevant to parallelism, even if the engineers who build our parallel computing platforms must deal with concurrency.  Another way to say the same thing is that parallelism is a useful abstraction, and abstractions should never be confused with their implementations.  (Numbers are not certain sets, but they can be implemented in terms of sets.  And so on.)

All well and good, parallelism is not concurrency, but what is it?  What is an appropriate model for parallel computation?  As in the sequential case, there are two answers, a machine-based approach and a language-based approach.  I am going to criticize the machine-based formulation, and argue for the language-based formulation on two grounds: (a) it is inherently more elegant and natural, and (b) it can be taught directly to students without any fuss or bother.  Much of what I have to say is derived from the seminal and beautiful work of Guy Blelloch, which I have absorbed over years of collaboration.  To the extent I’ve gotten it right, give Guy the credit; to the extent that I’ve gotten it wrong, assign me the blame.

The canonical machine-based model is the PRAM, or parallel RAM, which generalizes the classical RAM model to allow p>0 processors that share a common memory.  There are many variations on the PRAM, according to what sorts of interactions are allowed and how conflicts are resolved among the processors.  This is not important for present purposes, so I will not attempt to categorize the variations here.  The key idea is that the number, p, of processors is fixed in advance—results are stated in terms of how efficiently an algorithm executes on a p-processor PRAM as a function of the size, n, of the input.  This is well and good as a cost model for parallel execution, but it totally sucks as a programming model!  To write code for a PRAM, you fix the number of processors, and write the code for that number of processors (often the various cases are “analogous”, but in reality you must fix p and write code for that fixed p).  Moreover, you must (officially, anyway) write low-level imperative RAM code, which introduces spurious dependencies among sub-computations by explicitly scheduling the steps of the computation.  This is horrendous!  Of course typically one writes pidgin C, and sketches how it is compiled, but the fact is you’re still writing sequential, imperative code to implement a parallel algorithm.

As I’ve argued previously, what is needed is a language-based model of computation in which we assign costs to the steps of the program we actually write, not the one it (allegedly) compiles into.  Moreover, in the parallel setting we wish to think in terms of dependencies among computations, rather than the exact order in which they are to be executed.  This allows us to factor out the properties of the target platform, such as the number, p, of processing units available, and instead write the program in an intrinsically parallel manner, and let the compiler and run-time system (that is, the semantics of the language) sort out how to schedule it onto a parallel fabric.  Here’s a useful analogy.  Just as abstract languages allow us to think in terms of data structures such as trees or lists as forms of value and not bother about how to “schedule” the data structure into a sequence of words in memory, so a parallel language should allow us to think in terms of the dependencies among the phases of a large computation and not bother about how to schedule the workload onto processors.  Storage management is to abstract values as scheduling is to deterministic parallelism.

The paradigmatic example is provided by Quicksort, a naturally parallel algorithm.  Many people think that the “big idea” of Quicksort is the in-place swapping technique that Hoare invented in the original formulation.  I argue that this is a detail; the “big idea” is that it is parallelizable.  To see this, we must work at a reasonable level of abstraction: a sort function permutes a non-empty sequence of (comparable) elements to obtain another sequence in sorted order.

fun qs xs = if |xs|=1 then xs else let val (xsl, xsg) = split xs in concat (qs xsl, qs xsg)

That is, we split the sequence into two parts (by some means) such that everything in the left part is less than everything in the right part, then sort the parts independently (in parallel!), and merge the results together.  The exact complexity of Quicksort depends on the representation of sequences, and the complexity of the operations on them.  For now, let us just gloss over this to focus on the overall structure of the computation.  If we can ensure that the sequence is split roughly in half (by any of a number of means, including randomization), then the recursion is of logarithmic depth.  At each stage we do linear work, but the recursive calls are independent of one another.  The algorithm has a natural fork-join structure in which we (a) split the sequence into two parts, (b) in parallel sort the two parts, and (c) join the results together.  (The splitting itself can be done in parallel as well, but we’ll not bother about that here.)

Analysis of Quicksort splits into two parts, the work and the depth (or span).  The work is the overall number of steps required to perform the algorithm.  On a sequence of length n, we perform O(n log n) steps to sort the sequence, as might have been expected.  This is the sequential complexity, the amount of time it would take to run on a single processor.  To account for parallelism, there is a second measure, called the depth, that measures the critical path length of the computation, the length of the longest chain of dependencies among subcomputation.  No matter how much parallelism we may have available, we can never run faster than the depth; it represents the idealized parallel complexity of the algorithm, the amount of time it would take given any number of processors.  Depending on the choice of data structures, the depth of Quicksort is O(log^2 n), or thereabouts, which makes it quite parallelizable.  (Note that, as in our earlier discussion, a step of computation is a transition step in the operational semantics of the language we are using to express the algorithm.)

This is very nice and elegant, but now comes the really beautiful part, the concept of a provable implementation of the language.  (The word “provable” is only needed here because of the shabby state of the field; having a proof that your implementation works, and how well it works, remains an oddity to be emphasized, rather than an expectation to be fulfilled by any self-respecting computer scientist.)  The idea is that the language-based model can be implemented on a PRAM (with stated interconnect properties that I will suppress here) with a provable bound on the cost that takes account of the overhead of scheduling onto a p-processor fabric.  Thus, p need not be specified in advance, but is rather determined by the run-time system and compiler, which are provided by the proof of the theorem.

Theorem A computation with work w and depth d can be implemented in a p-processor PRAM in time O(max(w/p, d)).

(Sometimes the second factor is d log p, depending on the assumptions we are making about the target PRAM.)

This is called Brent’s Principle, and it states a fairly obvious fact.  First, we can never execute a depth d program in fewer than d steps.  The depth is the critical path length; this is the amount of time we must wait to execute the algorithm, regardless of any available parallelism.  Second, to the extent that parallelism arises from the dependency (rather, indepedency) structure of the computation, we can at best do p units of work at time, keeping all processors warm, until we are done.  And that’s it!

The effectiveness of the language-based model of parallelism lies entirely in its ability to expose the dependency structure of the computation by not introducing any dependencies that are not forced on us by the nature of the computation itself.  And the key to this is functional programming, which manifests itself here in the transformational approach to computation: sorting is conceived of as a mathematical function that transforms a given sequence into another sequence.  It does not destroy the given sequence any more than adding two numbers destroys those numbers!  Since Quicksort is a mathematical function, we need not worry that execution of qs xsl interferes with (depends on) qs xsg; we can readily run them in parallel without fear of untoward consequences.  The payoff is that there are many fewer dependencies among the subcomputations, and hence many more opportunities for parallelism that can be exploited, in accord with Brent’s Principle, when scheduling the work onto a parallel fabric.

The upshot of all this is that functional programming is of paramount importance for parallelism.  That’s (one reason) why we are teaching functional programming to freshmen at Carnegie Mellon.  And it’s the main reason why we are able to.  There’s nothing to it, just program like you do math, and the parallelism will take care of itself!

Update: Simon Marlowe made similar points to mine some time ago.