In an earlier post I mentioned that one goal of the new introductory curriculum at Carnegie Mellon is to teach parallelism as the general case of computing, rather than an esoteric, specialized subject for advanced students. Many people are incredulous when I tell them this, because it immediately conjures in their mind the myriad complexities of concurrency: locks, monitors, deadlock, dining philosophers, …. And if you look at the recent research literature, you will see countless papers of the form (a) parallelism is important, so (b) let’s talk about concurrency. The situation has gotten so bad that I think that most computer scientists cannot distinguish the two concepts. (Or maybe they willfully confuse the two so as to justify funding for work on concurrency by appealing to the importance of parallelism.)
Given this state of affairs, I cannot explain what we are doing at Carnegie Mellon (to anyone who doesn’t already know) without first rebooting you. Trust me, I’m just going to hit Ctrl-Alt-Delete, and then click Restart. OK, we’re back.
The first thing to understand is parallelism has nothing to do with concurrency. Concurrency is concerned with nondeterministic composition of programs (or their components). Parallelism is concerned with asymptotic efficiency of programs with deterministic behavior. Concurrency is all about managing the unmanageable: events arrive for reasons beyond our control, and we must respond to them. A user clicks a mouse, the window manager must respond, even though the display is demanding attention. Such situations are inherently nondeterministic, but we also employ pro forma nondeterminism in a deterministic setting by pretending that components signal events in an arbitrary order, and that we must respond to them as they arise. Nondeterministic composition is a powerful program structuring idea. Parallelism, on the other hand, is all about dependencies among the subcomputations of a deterministic computation. The result is not in doubt, but there are many means of achieving it, some more efficient than others. We wish to exploit those opportunities to our advantage.
Now I can hear you object, but isn’t concurrency required to implement parallelism? Well, yes, it is, but concurrency is also required to implement sequentiality too! The timing signal on your processor chip is essentially a synchronization mechanism with which to coordinate the otherwise independent activity of the components of the processor. (Yes, I know that there are asynchronous processors, but this only makes my point stronger, you still need to control the concurrency.) The point is that concurrency is not relevant to parallelism, even if the engineers who build our parallel computing platforms must deal with concurrency. Another way to say the same thing is that parallelism is a useful abstraction, and abstractions should never be confused with their implementations. (Numbers are not certain sets, but they can be implemented in terms of sets. And so on.)
All well and good, parallelism is not concurrency, but what is it? What is an appropriate model for parallel computation? As in the sequential case, there are two answers, a machine-based approach and a language-based approach. I am going to criticize the machine-based formulation, and argue for the language-based formulation on two grounds: (a) it is inherently more elegant and natural, and (b) it can be taught directly to students without any fuss or bother. Much of what I have to say is derived from the seminal and beautiful work of Guy Blelloch, which I have absorbed over years of collaboration. To the extent I’ve gotten it right, give Guy the credit; to the extent that I’ve gotten it wrong, assign me the blame.
The canonical machine-based model is the PRAM, or parallel RAM, which generalizes the classical RAM model to allow p>0 processors that share a common memory. There are many variations on the PRAM, according to what sorts of interactions are allowed and how conflicts are resolved among the processors. This is not important for present purposes, so I will not attempt to categorize the variations here. The key idea is that the number, p, of processors is fixed in advance—results are stated in terms of how efficiently an algorithm executes on a p-processor PRAM as a function of the size, n, of the input. This is well and good as a cost model for parallel execution, but it totally sucks as a programming model! To write code for a PRAM, you fix the number of processors, and write the code for that number of processors (often the various cases are “analogous”, but in reality you must fix p and write code for that fixed p). Moreover, you must (officially, anyway) write low-level imperative RAM code, which introduces spurious dependencies among sub-computations by explicitly scheduling the steps of the computation. This is horrendous! Of course typically one writes pidgin C, and sketches how it is compiled, but the fact is you’re still writing sequential, imperative code to implement a parallel algorithm.
As I’ve argued previously, what is needed is a language-based model of computation in which we assign costs to the steps of the program we actually write, not the one it (allegedly) compiles into. Moreover, in the parallel setting we wish to think in terms of dependencies among computations, rather than the exact order in which they are to be executed. This allows us to factor out the properties of the target platform, such as the number, p, of processing units available, and instead write the program in an intrinsically parallel manner, and let the compiler and run-time system (that is, the semantics of the language) sort out how to schedule it onto a parallel fabric. Here’s a useful analogy. Just as abstract languages allow us to think in terms of data structures such as trees or lists as forms of value and not bother about how to “schedule” the data structure into a sequence of words in memory, so a parallel language should allow us to think in terms of the dependencies among the phases of a large computation and not bother about how to schedule the workload onto processors. Storage management is to abstract values as scheduling is to deterministic parallelism.
The paradigmatic example is provided by Quicksort, a naturally parallel algorithm. Many people think that the “big idea” of Quicksort is the in-place swapping technique that Hoare invented in the original formulation. I argue that this is a detail; the “big idea” is that it is parallelizable. To see this, we must work at a reasonable level of abstraction: a sort function permutes a non-empty sequence of (comparable) elements to obtain another sequence in sorted order.
fun qs xs = if |xs|=1 then xs else let val (xsl, xsg) = split xs in concat (qs xsl, qs xsg)
That is, we split the sequence into two parts (by some means) such that everything in the left part is less than everything in the right part, then sort the parts independently (in parallel!), and merge the results together. The exact complexity of Quicksort depends on the representation of sequences, and the complexity of the operations on them. For now, let us just gloss over this to focus on the overall structure of the computation. If we can ensure that the sequence is split roughly in half (by any of a number of means, including randomization), then the recursion is of logarithmic depth. At each stage we do linear work, but the recursive calls are independent of one another. The algorithm has a natural fork-join structure in which we (a) split the sequence into two parts, (b) in parallel sort the two parts, and (c) join the results together. (The splitting itself can be done in parallel as well, but we’ll not bother about that here.)
Analysis of Quicksort splits into two parts, the work and the depth (or span). The work is the overall number of steps required to perform the algorithm. On a sequence of length n, we perform O(n log n) steps to sort the sequence, as might have been expected. This is the sequential complexity, the amount of time it would take to run on a single processor. To account for parallelism, there is a second measure, called the depth, that measures the critical path length of the computation, the length of the longest chain of dependencies among subcomputation. No matter how much parallelism we may have available, we can never run faster than the depth; it represents the idealized parallel complexity of the algorithm, the amount of time it would take given any number of processors. Depending on the choice of data structures, the depth of Quicksort is O(log^2 n), or thereabouts, which makes it quite parallelizable. (Note that, as in our earlier discussion, a step of computation is a transition step in the operational semantics of the language we are using to express the algorithm.)
This is very nice and elegant, but now comes the really beautiful part, the concept of a provable implementation of the language. (The word “provable” is only needed here because of the shabby state of the field; having a proof that your implementation works, and how well it works, remains an oddity to be emphasized, rather than an expectation to be fulfilled by any self-respecting computer scientist.) The idea is that the language-based model can be implemented on a PRAM (with stated interconnect properties that I will suppress here) with a provable bound on the cost that takes account of the overhead of scheduling onto a p-processor fabric. Thus, p need not be specified in advance, but is rather determined by the run-time system and compiler, which are provided by the proof of the theorem.
Theorem A computation with work w and depth d can be implemented in a p-processor PRAM in time O(max(w/p, d)).
(Sometimes the second factor is d log p, depending on the assumptions we are making about the target PRAM.)
This is called Brent’s Principle, and it states a fairly obvious fact. First, we can never execute a depth d program in fewer than d steps. The depth is the critical path length; this is the amount of time we must wait to execute the algorithm, regardless of any available parallelism. Second, to the extent that parallelism arises from the dependency (rather, indepedency) structure of the computation, we can at best do p units of work at time, keeping all processors warm, until we are done. And that’s it!
The effectiveness of the language-based model of parallelism lies entirely in its ability to expose the dependency structure of the computation by not introducing any dependencies that are not forced on us by the nature of the computation itself. And the key to this is functional programming, which manifests itself here in the transformational approach to computation: sorting is conceived of as a mathematical function that transforms a given sequence into another sequence. It does not destroy the given sequence any more than adding two numbers destroys those numbers! Since Quicksort is a mathematical function, we need not worry that execution of qs xsl interferes with (depends on) qs xsg; we can readily run them in parallel without fear of untoward consequences. The payoff is that there are many fewer dependencies among the subcomputations, and hence many more opportunities for parallelism that can be exploited, in accord with Brent’s Principle, when scheduling the work onto a parallel fabric.
The upshot of all this is that functional programming is of paramount importance for parallelism. That’s (one reason) why we are teaching functional programming to freshmen at Carnegie Mellon. And it’s the main reason why we are able to. There’s nothing to it, just program like you do math, and the parallelism will take care of itself!
Update: Simon Marlowe made similar points to mine some time ago.